linux-kernel - RE: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a994bdedca7d966168076044249a58e52754c6ac.camel@ibm.com>
Date: Mon, 9 Feb 2026 22:28:59 +0000
From: Viacheslav Dubeyko <Slava.Dubeyko@....com>
To: "chrisl@...nel.org" <chrisl@...nel.org>
CC: "clm@...a.com" <clm@...a.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Pavan Rallabhandi <Pavan.Rallabhandi@....com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "lsf-pc@...ts.linux-foundation.org" <lsf-pc@...ts.linux-foundation.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>
Subject: RE: [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel

On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
> On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
> <Slava.Dubeyko@....com> wrote:
> > 
> > Hello,
> > 
> > Machine Learning (ML) is approach/area of learning from data,
> > finding patterns, and making predictions without implementing algorithms
> > by developers. The number of areas of ML applications is growing
> > with every day. Generally speaking, ML can introduce a self-evolving and
> > self-learning capability in Linux kernel. There are already research works
> > and industry efforts to employ ML approaches for configuration and
> > optimization the Linux kernel. However, introduction of ML approaches
> > in Linux kernel is not so simple and straightforward way. There are multiple
> > problems and unanswered questions on this road. First of all, any ML model
> > requires the floating-point operations (FPU) for running. But there is
> > no direct use of FPUs in kernel space. Also, ML model requires training phase
> > that can be a reason of significant performance degradation of Linux kernel.
> > Even inference phase could be problematic from the performance point of view
> > on kernel side. The using of ML approaches in Linux kernel is inevitable step.
> > But, how can we use ML approaches in Linux kernel? Which infrastructure
> > do we need to adopt ML models in Linux kernel?
> 
> I think there are two different things, I think you want the latter
> but I am not sure
> 
> 1) using ML model to help kernel development, code reviews, generate
> patches by descriptions etc. For example, Chris Mason has a kernel
> review repo on github and he is sharing his review finding the mailing
> list:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e= 
> It is kernel development related, but the ML agent code is running in
> the user space. The actual ML computation might run GPU/TPUs. That
> does not seem to be what you have in mind.
> 
> 2) Run the ML model computation in the kernel space.
> Can you clarify if this is what you have in mind? You mention kernel
> FPU usage in the kernel for ML model. It is only relevant if you need
> to run the FP in the kernel CPU instructions. Most ML computations are
> not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
> ML program (PyTorch/agents) in the user space and pass the data to the
> GPU/TPU driver to run? There will be some kernel instructure like
> VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
> kernel is just facilitating the data passing to/from the GPU/TPU
> driver then to the GPU/TPU hardware. The ML hardware is doing the
> heavy lifting.

The idea is to have ML model running in user-space and kernel subsystem can
interact with ML model in user-space. As the next step, I am considering two
real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
approach. So, for example, GC can be represented by ML model in user-space. GC
can request data (segments state) from kernel-space and ML model in user-space
can do training or/and inference. As a result, ML model in user-space can select
victim segments and instruct kernel-space logic of moving valid data from victim
segment(s) into clean/current one(s). 

> 
> > What is the goal of using ML models in Linux kernel? The main goal is
> > to employ ML models for elaboration of a logic of particular Linux kernel
> > subsystem based on processing data or/and an efficient subsystem
> > configuration based on internal state of subsystem. As a result, it needs:
> > (1) collect data for training, (2) execute ML model training phase,
> > (3) test trained ML model, (4) use ML model for executing the inference phase.
> 
> As far as I can tell, a lot of those don't need to be in the kernel's
> business. It is more of a GPU/TPU driver user space interface thing,
> might be easier to allow the driver to convert their own kernel/user
> space API then expose common user space library API. Are you trying to
> define something like Nvidia CUDA at the kernel level?
> 
> > The ML model inference can be used for recommendation of Linux kernel
> > subsystem configuration or/and for injecting a synthesized subsystem logic
> > into kernel space (for example, eBPF logic).
> 
> That again sounds very much like a userspace issue, the above 1) usage case.
> 
> > How ML infrastructure can be designed in Linux kernel? It needs to introduce
> > in Linux kernel a special ML library that can implement a generalized
> > interface of interaction between ML model’s thread in user-space and kernel
> > subsystem. Likewise interface requires to have the means:
> > (1) create/initialize/destroy ML model proxy in kernel subsystem,
> > (2) start/stop ML model proxy, (3) get/preprocess/publish data sets
> > from kernel space, (4) receive/preprocess/apply ML model recommendation(s)
> > from user-space, (5) execute synthesized logic/recommendations in kernel-space,
> > (6) estimate efficiency of synthesized logic/recommendations,
> > (7) execute error back-propagation with the goal of correction ML model
> > on user-space side.
> 
> Unfortunately a lot of those will be tight to the internal
> implementation of the GPU/TPU. The model needs to be compiled into
> GPU/TPU machine instructions. So forcing a common interface will be
> hard because the lower interface requirement might be very different.
> Maybe having some common user space library or ML description language
> is better than forcing a kernel interface.
> 
> > The create and initialize logic can be executed by kernel subsystem during
> > module load or Linux kernel start (oppositely, module unload or kernel
> > shutdown will execute destroy of ML model proxy logic). ML model thread
> > in user-space will be capable to re-initialize and to execute
> > the start/stop logic of  ML model proxy on kernel side. First of all,
> > ML model needs to be trained by data from kernel space. The data can be
> > requested by ML model from user-space or data can be published by ML model
> > proxy from kernel-space. The sysfs interface can be used to orchestrate
> > this interaction. As a result, ML model in user-space should be capable
> > to extract data set(s) from kernel space through sysfs, FUSE or character
> > device. Extracted data can be stored in persistent storage and, finally,
> > ML model can be trained in user-space by accessing these data.
> 
> Currently a lot of those are happening in the GPU/TPU drivers and user
> space library. One challenging aspect is the hardware interface is
> very different between GPUs/TPUs, and might be challenging to expose
> common interfaces.
> 
> > The continuous learning model can be adopted during training phase.
> > It implies that kernel subsystem can receive ML model recommendations
> > even during training phase. ML model proxy on kernel side can estimate
> > the current kernel subsystem state, tries to apply the ML model
> > recommendations, and estimate the efficiency of applied recommendations.
> > Generally speaking, ML model proxy on kernel side can consider several
> > modes of interaction with ML model recommendations: (1) emergency mode,
> 
> That sounds like user space interaction again. Not sure it is for the
> kernel space.

Thanks a lot for sharing all your thoughts. :) I think I need to point out that:
ML model running in user-space and kernel subsystem can interact with ML model
in user-space. :) This is the main idea. The goal of ML library is to implement
generalized interface/functionality that can give the capability for any kernel
subsystem to be extended by ML model in user-space. And I believe that we can
provide this in generic way.

And you can check the patchset [1] to see the vision of potential implementation
of the idea.

Thanks,
Slava.

[1]
https://lore.kernel.org/linux-fsdevel/20260206191136.2609767-1-slava@dubeyko.com/T/#t