linux-kernel - Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbATDURsi265PGQ7022vC9QsKUxxyiDUL9wLKGgVpaxJUw@mail.gmail.com>
Date: Tue, 7 Oct 2025 16:47:07 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>, ziy@...dia.com, 
	baolin.wang@...ux.alibaba.com, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Liam Howlett <Liam.Howlett@...cle.com>, npache@...hat.com, ryan.roberts@....com, 
	dev.jain@....com, Johannes Weiner <hannes@...xchg.org>, usamaarif642@...il.com, 
	gutierrez.asier@...wei-partners.com, Matthew Wilcox <willy@...radead.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Amery Hung <ameryhung@...il.com>, 
	David Rientjes <rientjes@...gle.com>, Jonathan Corbet <corbet@....net>, 21cnbao@...il.com, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Tejun Heo <tj@...nel.org>, lance.yang@...ux.dev, 
	Randy Dunlap <rdunlap@...radead.org>, bpf <bpf@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP
 order selection

On Fri, Oct 3, 2025 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@...il.com> wrote:
> >
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                     enum tva_type type,
> > +                                     unsigned long orders)
> > +{
> > +       thp_order_fn_t *bpf_hook_thp_get_order;
> > +       int bpf_order;
> > +
> > +       /* No BPF program is attached */
> > +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                     &transparent_hugepage_flags))
> > +               return orders;
> > +
> > +       rcu_read_lock();
> > +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> > +               goto out;
> > +
> > +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> > +       orders &= BIT(bpf_order);
> > +
> > +out:
> > +       rcu_read_unlock();
> > +       return orders;
> > +}
>

Hello Alexei,

My apologies for the slow reply. I'm on a family vacation and am
checking email intermittently.

> I thought I explained it earlier.

I recall your earlier suggestion for a cgroup-based approach for
BPF-THP. However, as I mentioned, I believe cgroups might not be the
best fit[0]. My understanding was that we had agreed to move away from
that model. Could we realign on this?

[0].  https://lwn.net/ml/all/CALOAHbBvwT+6f_4gBHzPc9n_SukhAs_sa5yX=AjHYsWic1MRuw@mail.gmail.com/

> Nack to a single global prog approach.

The design of BPF-THP as a global program is a direct consequence of
its purpose: to extend the existing global
/sys/kernel/mm/transparent_hugepage/ interface. This architectural
consistency simplifies both understanding and maintenance.

Crucially, this global nature does not limit policy control. The
program is designed with the flexibility to enforce policies at
multiple levels—globally, per-cgroup, or per-task—enabling all of our
target use cases through a unified mechanism.

>
> The logic must accommodate multiple programs per-container
> or any other way from the beginning.
> If cgroup based scoping doesn't fit use per process tree scoping.

During the initial design of BPF-THP, I evaluated whether a global
program or a per-process program would be more suitable. While a
per-process design would require embedding a struct_ops into
task_struct, this seemed like over-engineering to me. We can
efficiently implement both cgroup-tree-scoped and process-tree-scoped
THP policies using existing BPF helpers, such as:

  SCOPING                        BPF kfuncs
  cgroup tree   ->  bpf_task_under_cgroup()
  process tree -> bpf_task_is_ ancestors()

With these kfuncs, there is no need to attach individual BPF-THP
programs to every process or cgroup tree. I have not identified a
valid use case that necessitates embedding a struct_ops in task_struct
which can't be achieved more simply with these kfuncs. If such use
cases exist, please detail them. Consequently, I proceeded with a
global struct_ops implementation.

The desire to attach multiple BPF-THP programs simultaneously does not
appear to be a valid use case. Furthermore, our production experience
has shown that multiple attachments often introduce conflicts. This is
precisely why system administrators prefer to manage BPF programs with
a single manager—to avoid undefined behaviors from competing programs.

Focusing specifically on BPF-THP, the semantics of the program make
multiple attachments unsuitable. A BPF-THP program's outcome is its
return value (a suggested THP order), not the side effects of its
execution. In other words, it is functionally a variant of fmod_ret.

If we allow multiple attachments and they return different values, how
do we resolve the conflict?

If one program returns order-9 and another returns order-1, which
value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
appropriate. The only logical solution is to reject subsequent
attachments and explicitly notify the user of the conflict. Our goal
should be to prevent conflicts from the outset, rather than forcing
developers to create another userspace manager to handle them.

A single global program is a natural and logical extension of the
existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
a good fit for BPF-THP and avoids unnecessary complexity.

Please provide a detailed clarification if I have misunderstood your position.

-- 
Regards
Yafang