linux-kernel - Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a512muze.fsf@linux.dev>
Date: Mon, 03 Nov 2025 17:45:09 -0800
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Michal Hocko <mhocko@...e.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
  linux-kernel@...r.kernel.org,  Alexei Starovoitov <ast@...nel.org>,
  Suren Baghdasaryan <surenb@...gle.com>,  Shakeel Butt
 <shakeel.butt@...ux.dev>,  Johannes Weiner <hannes@...xchg.org>,  Andrii
 Nakryiko <andrii@...nel.org>,  JP Kobryn <inwardvessel@...il.com>,
  linux-mm@...ck.org,  cgroups@...r.kernel.org,  bpf@...r.kernel.org,
  Martin KaFai Lau <martin.lau@...nel.org>,  Song Liu <song@...nel.org>,
  Kumar Kartikeya Dwivedi <memxor@...il.com>,  Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling

Michal Hocko <mhocko@...e.com> writes:

> On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
>> Michal Hocko <mhocko@...e.com> writes:
>> 
>> > On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
>> >> Introduce a bpf struct ops for implementing custom OOM handling
>> >> policies.
>> >> 
>> >> It's possible to load one bpf_oom_ops for the system and one
>> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> >> corresponding BPF OOM handlers are executed until some memory is
>> >> freed. If no memory is freed, the kernel OOM killer is invoked.
>> >
>> > Do you have any usecase in mind where parent memcg oom handler decides
>> > to not kill or cannot kill anything and hand over upwards in the
>> > hierarchy?
>> 
>> I believe that in most cases bpf handlers will handle ooms themselves,
>> but because strictly speaking I don't have control over what bpf
>> programs do or do not, the kernel should provide the fallback mechanism.
>> This is a common practice with bpf, e.g. sched_ext falls back to
>> CFS/EEVDF in case something is wrong.
>
> We do have fallback mechanism - the kernel oom handling. For that we do
> not need to pass to parent handler. Please not that I am not opposing
> this but I would like to understand thinking behind and hopefully start
> with a simpler model and then extend it later than go with a more
> complex one initially and then corner ourselves with weird side
> effects.
>  
>> Specifically to OOM case, I believe someone might want to use bpf
>> programs just for monitoring/collecting some information, without
>> trying to actually free some memory.
>> 
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory. If both are set, OOM is
>> >> considered handled, otherwise the next OOM handler in the chain
>> >> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
>> >> killer) is executed.
>> >
>> > Could you explain why do we need both? Why is not bpf_memory_freed
>> > return value sufficient?
>> 
>> Strictly speaking, bpf_memory_freed should be enough, but because
>> bpf programs have to return an int and there is no additional cost
>> to add this option (pass to next or in-kernel oom handler), I thought
>> it's not a bad idea. If you feel strongly otherwise, I can ignore
>> the return value on rely on bpf_memory_freed only.
>
> No, I do not feel strongly one way or the other but I would like to
> understand thinking behind that. My slight preference would be to have a
> single return status that clearly describe the intention. If you want to
> have more flexible chaining semantic then an enum { IGNORED, HANDLED,
> PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
> to understand.

The thinking is simple:
1) Most users will have a single global bpf oom policy, which basically
replaces the in-kernel oom killer.
2) If there are standalone containers, they might want to do the same on
their level. And the "host" system doesn't directly control it.
3) If for some reason the inner oom handler fails to free up some
memory, there are two potential fallback options: call the in-kernel oom
killer for that memory cgroup or call an upper level bpf oom killer, if
there is one.

I think the latter is more logical and less surprising. Imagine you're
running multiple containers and some of them implement their own bpf oom
logic and some don't. Why would we treat them differently if their bpf
logic fails?

Re a single return value: I can absolutely specify return values as an
enum, my point is that unlike the kernel code we can't fully trust the
value returned from a bpf program, this is why the second check is in
place.

Can we just ignore the returned value and rely on the freed_memory flag?
Sure, but I don't think it bus us anything.

Also, I have to admit that I don't have an immediate production use case
for nested oom handlers (I'm fine with a global one), but it was asked
by Alexei Starovoitov. And I agree with him that the containerized case
will come up soon, so it's better to think of it in advance.

>> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
>> >> using iterators, e.g. cgroup iterators. The callback receives struct
>> >> oom_control as an argument, so it can determine the scope of the OOM
>> >> event: if this is a memcg-wide or system-wide OOM.
>> >
>> > This could be tricky because it might introduce a subtle and hard to
>> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
>> > Sleepable locks should be only allowed in trylock mode.
>> 
>> Agree, but it's achieved by controlling the context where oom can be
>> declared (e.g. in bpf_psi case it's done from a work context).
>
> but out_of_memory is any sleepable context. So this is a real problem.

We need to restrict both:
1) where from bpf_out_of_memory() can be called (already done, as of now
only from bpf_psi callback, which is safe).
2) which kfuncs are available to bpf oom handlers (only those, which are
not trying to grab unsafe locks) - I'll double check it in thenext version.

Thank you!