[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1940d681-94a6-48fb-b889-cd8f0b91b330@huawei-partners.com>
Date: Wed, 8 Oct 2025 15:49:40 +0300
From: Gutierrez Asier <gutierrez.asier@...wei-partners.com>
To: Yafang Shao <laoar.shao@...il.com>, Zi Yan <ziy@...dia.com>
CC: David Hildenbrand <david@...hat.com>, Alexei Starovoitov
<alexei.starovoitov@...il.com>, Johannes Weiner <hannes@...xchg.org>, Andrew
Morton <akpm@...ux-foundation.org>, <baolin.wang@...ux.alibaba.com>, Lorenzo
Stoakes <lorenzo.stoakes@...cle.com>, Liam Howlett <Liam.Howlett@...cle.com>,
<npache@...hat.com>, <ryan.roberts@....com>, <dev.jain@....com>,
<usamaarif642@...il.com>, Matthew Wilcox <willy@...radead.org>, Alexei
Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Andrii
Nakryiko <andrii@...nel.org>, Amery Hung <ameryhung@...il.com>, David
Rientjes <rientjes@...gle.com>, Jonathan Corbet <corbet@....net>,
<21cnbao@...il.com>, Shakeel Butt <shakeel.butt@...ux.dev>, Tejun Heo
<tj@...nel.org>, <lance.yang@...ux.dev>, Randy Dunlap
<rdunlap@...radead.org>, bpf <bpf@...r.kernel.org>, linux-mm
<linux-mm@...ck.org>, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP
order selection
Hi,
On 10/8/2025 3:06 PM, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@...dia.com> wrote:
>>
>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>
>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@...hat.com> wrote:
>>>>>>
>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@...il.com> wrote:
>>>>>>>>
>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>> + enum tva_type type,
>>>>>>>> + unsigned long orders)
>>>>>>>> +{
>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>> + int bpf_order;
>>>>>>>> +
>>>>>>>> + /* No BPF program is attached */
>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>> + &transparent_hugepage_flags))
>>>>>>>> + return orders;
>>>>>>>> +
>>>>>>>> + rcu_read_lock();
>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>> + goto out;
>>>>>>>> +
>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>> + orders &= BIT(bpf_order);
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> + rcu_read_unlock();
>>>>>>>> + return orders;
>>>>>>>> +}
>>>>>>>
>>>>>>> I thought I explained it earlier.
>>>>>>> Nack to a single global prog approach.
>>>>>>
>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>> or more refined for cgroups/processes.
>>>>>>
>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>> own policy: I can see use cases for that.
>>>>>>
>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>
>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>> within the mm_struct as follows:
>>>>>
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +struct bpf_mm_ops {
>>>>> +#ifdef CONFIG_BPF_THP
>>>>> + struct bpf_thp_ops bpf_thp;
>>>>> +#endif
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>> /*
>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>> * mm_flags_xxx() helper functions.
>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>> #ifdef CONFIG_MM_ID
>>>>> mm_id_t mm_id;
>>>>> #endif /* CONFIG_MM_ID */
>>>>> +
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> + struct bpf_mm_ops bpf_mm;
>>>>> +#endif
>>>>> } __randomize_layout;
>>>>>
>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>
>>>> That's what we do on linux-mm :)
>>>>
>>>> It would be great to use Alexei's feedback/experience to come up with
>>>> something that is flexible for various use cases.
>>>
>>> I'm still not entirely convinced that allowing individual processes or
>>> cgroups to run independent progs is a valid use case. However, since
>>> we have a consensus that this is the right direction, I will proceed
>>> with this approach.
>>>
>>>>
>>>> So I think this is likely the right direction.
>>>>
>>>> It would be great to evaluate which scenarios we could unlock with this
>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>> extensive/involved the changes will be.
>>>
>>> 1. Global Approach
>>> - Pros:
>>> Simple;
>>> Can manage different THP policies for different cgroups or processes.
>>> - Cons:
>>> Does not allow individual processes to run their own BPF programs.
>>>
>>> 2. Per-Process Approach
>>> - Pros:
>>> Enables each process to run its own BPF program.
>>> - Cons:
>>> Introduces significant complexity, as it requires handling the
>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>> every mm_struct.
>>>
>>> 3. Per-Cgroup Approach
>>> - Pros:
>>> Allows individual cgroups to run their own BPF programs.
>>> Less complex than the per-process model, as it can leverage the
>>> existing cgroup operations structure.
>>> - Cons:
>>> Creates a dependency on the cgroup subsystem.
>>> might not be easy to control at the per-process level.
>>
>> Another issue is that how and who to deal with hierarchical cgroup, where one
>> cgroup is a parent of another. Should bpf program to do that or mm code
>> to do that?
>
> The cgroup subsystem handles this propagation automatically. When a
> BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
> automatically inherited by all descendant cgroups.
>
> Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
> requiring us to build new attachment mechanisms for cgroup-based
> struct-ops.
>
>> I remember hierarchical cgroup is the main reason THP control
>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>> get the same rejection from cgroup folks?
>
> Right, it was rejected by the cgroup maintainers [0]
>
> [0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
>
Yes, the patch was rejected because:
1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
tune those processes
3. We set a precedent for other people adding new flags to cgroup and
potentially polluting cgroups. We may end up with cgroups having tens of
different flags, making sysadmin's job more complex
In the MM call I proposed a new mechanism based on limits, something like
hugetlbfs.
The main issue, still, is that the sysadmins need to set those up, making
their life more complex.
I remember few participants mentioned the idea of the kernel setting huge page
consumption automatically using some sort of heuristics. To be honest, I
haven't have the time to sit and think about it. I would be glad to cooperate
and come up with a feasible solution.
--
Asier Gutierrez
Huawei
Powered by blists - more mailing lists