[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP01T76AUkN_v425s5DjCyOg_xxFGQ=P1jGBDv6XkbL5wwetHA@mail.gmail.com>
Date: Wed, 20 Aug 2025 13:28:46 +0200
From: Kumar Kartikeya Dwivedi <memxor@...il.com>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: linux-mm@...ck.org, bpf@...r.kernel.org,
Suren Baghdasaryan <surenb@...gle.com>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...e.com>,
David Rientjes <rientjes@...gle.com>, Matt Bobrowski <mattbobrowski@...gle.com>,
Song Liu <song@...nel.org>, Alexei Starovoitov <ast@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@...ux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling policies.
>
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise.
>
> In the latter case it's guaranteed that the in-kernel OOM killer will
> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. It's a safety mechanism which
> prevents a bpf program to claim forward progress without actually
> releasing memory. The callback program is sleepable to enable using
> iterators, e.g. cgroup iterators.
>
> The callback receives struct oom_control as an argument, so it can
> easily filter out OOM's it doesn't want to handle, e.g. global vs
> memcg OOM's.
>
> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.
>
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.
>
> [ 112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> oom_policy=bpf_test_policy
> [ 112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> [ 112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> [ 112.698167] Call Trace:
> [ 112.698177] <TASK>
> [ 112.698182] dump_stack_lvl+0x4d/0x70
> [ 112.698192] dump_header+0x59/0x1c6
> [ 112.698199] oom_kill_process.cold+0x8/0xef
> [ 112.698206] bpf_oom_kill_process+0x59/0xb0
> [ 112.698216] bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> [ 112.698229] bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> [ 112.698236] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 112.698240] bpf_handle_oom+0x11a/0x1e0
> [ 112.698250] out_of_memory+0xab/0x5c0
> [ 112.698258] mem_cgroup_out_of_memory+0xbc/0x110
> [ 112.698274] try_charge_memcg+0x4b5/0x7e0
> [ 112.698288] charge_memcg+0x2f/0xc0
> [ 112.698293] __mem_cgroup_charge+0x30/0xc0
> [ 112.698299] do_anonymous_page+0x40f/0xa50
> [ 112.698311] __handle_mm_fault+0xbba/0x1140
> [ 112.698317] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 112.698335] handle_mm_fault+0xe6/0x370
> [ 112.698343] do_user_addr_fault+0x211/0x6a0
> [ 112.698354] exc_page_fault+0x75/0x1d0
> [ 112.698363] asm_exc_page_fault+0x26/0x30
> [ 112.698366] RIP: 0033:0x7fa97236db00
>
> It's possible to load multiple bpf struct programs. In the case of
> oom, they will be executed one by one in the same order they been
> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> - an indication that the memory was freed. This allows to have
> multiple bpf programs to focus on different types of OOM's - e.g.
> one program can only handle memcg OOM's in one memory cgroup.
> But the filtering is done in bpf - so it's fully flexible.
I think a natural question here is ordering. Is this ability to have
multiple OOM programs critical right now?
How is it decided who gets to run before the other? Is it based on
order of attachment (which can be non-deterministic)?
There was a lot of discussion on something similar for tc progs, and
we went with specific flags that capture partial ordering constraints
(instead of priorities that may collide).
https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
It would be nice if we can find a way of making this consistent.
Another option is to exclude the multiple attachment bit from the
initial version and do this as a follow up, since it probably requires
more discussion.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@...ux.dev>
> ---
> [...]
Powered by blists - more mailing lists