linux-kernel - Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com>
Date: Wed, 8 Oct 2025 15:49:40 +0300
From: Gutierrez Asier <gutierrez.asier@...wei-partners.com>
To: Yafang Shao <laoar.shao@...il.com>, Zi Yan <ziy@...dia.com>
CC: David Hildenbrand <david@...hat.com>, Alexei Starovoitov
	<alexei.starovoitov@...il.com>, Johannes Weiner <hannes@...xchg.org>, Andrew
 Morton <akpm@...ux-foundation.org>, <baolin.wang@...ux.alibaba.com>, Lorenzo
 Stoakes <lorenzo.stoakes@...cle.com>, Liam Howlett <Liam.Howlett@...cle.com>,
	<npache@...hat.com>, <ryan.roberts@....com>, <dev.jain@....com>,
	<usamaarif642@...il.com>, Matthew Wilcox <willy@...radead.org>, Alexei
 Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Andrii
 Nakryiko <andrii@...nel.org>, Amery Hung <ameryhung@...il.com>, David
 Rientjes <rientjes@...gle.com>, Jonathan Corbet <corbet@....net>,
	<21cnbao@...il.com>, Shakeel Butt <shakeel.butt@...ux.dev>, Tejun Heo
	<tj@...nel.org>, <lance.yang@...ux.dev>, Randy Dunlap
	<rdunlap@...radead.org>, bpf <bpf@...r.kernel.org>, linux-mm
	<linux-mm@...ck.org>, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP
 order selection

Hi,

On 10/8/2025 3:06 PM, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@...dia.com> wrote:
>>
>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>
>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@...hat.com> wrote:
>>>>>>
>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@...il.com> wrote:
>>>>>>>>
>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>> +                                     enum tva_type type,
>>>>>>>> +                                     unsigned long orders)
>>>>>>>> +{
>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>> +       int bpf_order;
>>>>>>>> +
>>>>>>>> +       /* No BPF program is attached */
>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>> +               return orders;
>>>>>>>> +
>>>>>>>> +       rcu_read_lock();
>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>> +               goto out;
>>>>>>>> +
>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> +       rcu_read_unlock();
>>>>>>>> +       return orders;
>>>>>>>> +}
>>>>>>>
>>>>>>> I thought I explained it earlier.
>>>>>>> Nack to a single global prog approach.
>>>>>>
>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>> or more refined for cgroups/processes.
>>>>>>
>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>> own policy: I can see use cases for that.
>>>>>>
>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>
>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>> within the mm_struct as follows:
>>>>>
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +struct bpf_mm_ops {
>>>>> +#ifdef CONFIG_BPF_THP
>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>> +#endif
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>>   /*
>>>>>    * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>    * mm_flags_xxx() helper functions.
>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>   #ifdef CONFIG_MM_ID
>>>>>                  mm_id_t mm_id;
>>>>>   #endif /* CONFIG_MM_ID */
>>>>> +
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>> +#endif
>>>>>          } __randomize_layout;
>>>>>
>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>
>>>> That's what we do on linux-mm :)
>>>>
>>>> It would be great to use Alexei's feedback/experience to come up with
>>>> something that is flexible for various use cases.
>>>
>>> I'm still not entirely convinced that allowing individual processes or
>>> cgroups to run independent progs is a valid use case. However, since
>>> we have a consensus that this is the right direction, I will proceed
>>> with this approach.
>>>
>>>>
>>>> So I think this is likely the right direction.
>>>>
>>>> It would be great to evaluate which scenarios we could unlock with this
>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>> extensive/involved the changes will be.
>>>
>>> 1. Global Approach
>>>    - Pros:
>>>      Simple;
>>>      Can manage different THP policies for different cgroups or processes.
>>>   - Cons:
>>>      Does not allow individual processes to run their own BPF programs.
>>>
>>> 2. Per-Process Approach
>>>     - Pros:
>>>       Enables each process to run its own BPF program.
>>>     - Cons:
>>>       Introduces significant complexity, as it requires handling the
>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>> every mm_struct.
>>>
>>> 3. Per-Cgroup Approach
>>>     - Pros:
>>>        Allows individual cgroups to run their own BPF programs.
>>>        Less complex than the per-process model, as it can leverage the
>>> existing cgroup operations structure.
>>>     - Cons:
>>>        Creates a dependency on the cgroup subsystem.
>>>        might not be easy to control at the per-process level.
>>
>> Another issue is that how and who to deal with hierarchical cgroup, where one
>> cgroup is a parent of another. Should bpf program to do that or mm code
>> to do that?
> 
> The cgroup subsystem handles this propagation automatically. When a
> BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
> automatically inherited by all descendant cgroups.
> 
> Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
> requiring us to build new attachment mechanisms for cgroup-based
> struct-ops.
> 
>> I remember hierarchical cgroup is the main reason THP control
>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>> get the same rejection from cgroup folks?
> 
> Right, it was rejected by the cgroup maintainers [0]
> 
> [0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
> 

Yes, the patch was rejected because:

1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and 
   tune those processes
3. We set a precedent for other people adding new flags to cgroup and 
   potentially polluting cgroups. We may end up with cgroups having tens of 
   different flags, making sysadmin's job more complex

In the MM call I proposed a new mechanism based on limits, something like 
hugetlbfs.

The main issue, still, is that the sysadmins need to set those up, making 
their life more complex.

I remember few participants mentioned the idea of the kernel setting huge page 
consumption automatically using some sort of heuristics. To be honest, I 
haven't have the time to sit and think about it. I would be glad to cooperate
and come up with a feasible solution.

-- 
Asier Gutierrez
Huawei