[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVH0ovOcBqCN7kJ3n0QFmvuf+_5tMeRXs-JAQ+m5fdoCg@mail.gmail.com>
Date: Mon, 9 Feb 2026 02:03:18 -0800
From: Chris Li <chrisl@...nel.org>
To: Viacheslav Dubeyko <Slava.Dubeyko@....com>
Cc: "lsf-pc@...ts.linux-foundation.org" <lsf-pc@...ts.linux-foundation.org>,
Viacheslav Dubeyko <vdubeyko@...hat.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
Pavan Rallabhandi <Pavan.Rallabhandi@....com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
Chris Mason <clm@...a.com>
Subject: Re: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel
On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
<Slava.Dubeyko@....com> wrote:
>
> Hello,
>
> Machine Learning (ML) is approach/area of learning from data,
> finding patterns, and making predictions without implementing algorithms
> by developers. The number of areas of ML applications is growing
> with every day. Generally speaking, ML can introduce a self-evolving and
> self-learning capability in Linux kernel. There are already research works
> and industry efforts to employ ML approaches for configuration and
> optimization the Linux kernel. However, introduction of ML approaches
> in Linux kernel is not so simple and straightforward way. There are multiple
> problems and unanswered questions on this road. First of all, any ML model
> requires the floating-point operations (FPU) for running. But there is
> no direct use of FPUs in kernel space. Also, ML model requires training phase
> that can be a reason of significant performance degradation of Linux kernel.
> Even inference phase could be problematic from the performance point of view
> on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> But, how can we use ML approaches in Linux kernel? Which infrastructure
> do we need to adopt ML models in Linux kernel?
I think there are two different things, I think you want the latter
but I am not sure
1) using ML model to help kernel development, code reviews, generate
patches by descriptions etc. For example, Chris Mason has a kernel
review repo on github and he is sharing his review finding the mailing
list:
https://github.com/masoncl/review-prompts/tree/main
It is kernel development related, but the ML agent code is running in
the user space. The actual ML computation might run GPU/TPUs. That
does not seem to be what you have in mind.
2) Run the ML model computation in the kernel space.
Can you clarify if this is what you have in mind? You mention kernel
FPU usage in the kernel for ML model. It is only relevant if you need
to run the FP in the kernel CPU instructions. Most ML computations are
not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
ML program (PyTorch/agents) in the user space and pass the data to the
GPU/TPU driver to run? There will be some kernel instructure like
VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
kernel is just facilitating the data passing to/from the GPU/TPU
driver then to the GPU/TPU hardware. The ML hardware is doing the
heavy lifting.
> What is the goal of using ML models in Linux kernel? The main goal is
> to employ ML models for elaboration of a logic of particular Linux kernel
> subsystem based on processing data or/and an efficient subsystem
> configuration based on internal state of subsystem. As a result, it needs:
> (1) collect data for training, (2) execute ML model training phase,
> (3) test trained ML model, (4) use ML model for executing the inference phase.
As far as I can tell, a lot of those don't need to be in the kernel's
business. It is more of a GPU/TPU driver user space interface thing,
might be easier to allow the driver to convert their own kernel/user
space API then expose common user space library API. Are you trying to
define something like Nvidia CUDA at the kernel level?
> The ML model inference can be used for recommendation of Linux kernel
> subsystem configuration or/and for injecting a synthesized subsystem logic
> into kernel space (for example, eBPF logic).
That again sounds very much like a userspace issue, the above 1) usage case.
> How ML infrastructure can be designed in Linux kernel? It needs to introduce
> in Linux kernel a special ML library that can implement a generalized
> interface of interaction between ML model’s thread in user-space and kernel
> subsystem. Likewise interface requires to have the means:
> (1) create/initialize/destroy ML model proxy in kernel subsystem,
> (2) start/stop ML model proxy, (3) get/preprocess/publish data sets
> from kernel space, (4) receive/preprocess/apply ML model recommendation(s)
> from user-space, (5) execute synthesized logic/recommendations in kernel-space,
> (6) estimate efficiency of synthesized logic/recommendations,
> (7) execute error back-propagation with the goal of correction ML model
> on user-space side.
Unfortunately a lot of those will be tight to the internal
implementation of the GPU/TPU. The model needs to be compiled into
GPU/TPU machine instructions. So forcing a common interface will be
hard because the lower interface requirement might be very different.
Maybe having some common user space library or ML description language
is better than forcing a kernel interface.
> The create and initialize logic can be executed by kernel subsystem during
> module load or Linux kernel start (oppositely, module unload or kernel
> shutdown will execute destroy of ML model proxy logic). ML model thread
> in user-space will be capable to re-initialize and to execute
> the start/stop logic of ML model proxy on kernel side. First of all,
> ML model needs to be trained by data from kernel space. The data can be
> requested by ML model from user-space or data can be published by ML model
> proxy from kernel-space. The sysfs interface can be used to orchestrate
> this interaction. As a result, ML model in user-space should be capable
> to extract data set(s) from kernel space through sysfs, FUSE or character
> device. Extracted data can be stored in persistent storage and, finally,
> ML model can be trained in user-space by accessing these data.
Currently a lot of those are happening in the GPU/TPU drivers and user
space library. One challenging aspect is the hardware interface is
very different between GPUs/TPUs, and might be challenging to expose
common interfaces.
> The continuous learning model can be adopted during training phase.
> It implies that kernel subsystem can receive ML model recommendations
> even during training phase. ML model proxy on kernel side can estimate
> the current kernel subsystem state, tries to apply the ML model
> recommendations, and estimate the efficiency of applied recommendations.
> Generally speaking, ML model proxy on kernel side can consider several
> modes of interaction with ML model recommendations: (1) emergency mode,
That sounds like user space interaction again. Not sure it is for the
kernel space.
Chris
Powered by blists - more mailing lists