linux-kernel - Re: [PATCH bpf-next] bpf: Add a retry after refilling the free list when unit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4fd39e4b-f2dc-4b7d-a3be-ec3eae8d592a@igalia.com>
Date: Fri, 14 Feb 2025 18:23:55 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eddy Z <eddyz87@...il.com>,
 Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Tejun Heo <tj@...nel.org>,
 Andrea Righi <arighi@...dia.com>, kernel-dev@...lia.com,
 bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf: Add a retry after refilling the free list
 when unit_alloc() fails

Hello Alexei,

Thank you for the comments! I reordered your comments for ease of 
explanation.

On 25. 2. 14. 02:45, Alexei Starovoitov wrote:
> On Wed, Feb 12, 2025 at 12:49 AM Changwoo Min <changwoo@...lia.com> wrote:

> The commit log is too terse to understand what exactly is going on.
> Pls share the call stack. What is the allocation size?
> How many do you do in a sequence?

The symptom is that an scx scheduler (scx_lavd) fails to load on
an ARM64 platform on its first try. The second try succeeds. In
the failure case, the kernel spits the following messages:

[   27.431380] sched_ext: BPF scheduler "lavd" disabled (runtime error)
[   27.431396] sched_ext: lavd: ops.init() failed (-12)
[   27.431401]    scx_ops_enable.isra.0+0x838/0xe48
[   27.431413]    bpf_scx_reg+0x18/0x30
[   27.431418]    bpf_struct_ops_link_create+0x144/0x1a0
[   27.431427]    __sys_bpf+0x1560/0x1f98
[   27.431433]    __arm64_sys_bpf+0x2c/0x80
[   27.431439]    do_el0_svc+0x74/0x120
[   27.431446]    el0_svc+0x80/0xb0
[   27.431454]    el0t_64_sync_handler+0x120/0x138
[   27.431460]    el0t_64_sync+0x174/0x178

The ops.init() failed because the 5th bpf_cpumask_create() calls
failed during the initialization of the BPF scheduler. The exact
point where bpf_cpumask_create() failed is here [1]. That scx
scheduler allocates 5 CPU masks to aid its scheduling decision.

Also, it seems that there is no graceful way to handle the
allocation failure since it happens during the initialization of
the scx scheduler.

In my digging of the code, bpf_cpumask_create() relies on
bpf_mem_cache_alloc(), and bpf_mem_alloc_init() prefills only
4 entries per CPU (prefill_mem_cache), so the 5th allocation of
the cpumask failed.

Increasing the prefill entries would be a solution, but that
would cause unnecessary memory overhead in other cases, so
I avoided that approach.

> But we may do something.
> Draining free_by_rcu_ttrace and waiting_for_gp_ttrace can be done,
> but will it address your case?

Unfortunately, harvesting free_by_rcu_ttrace and
waiting_for_gp_ttrace does not help (I tested it). In my case,
the memory allocation fails when loading an scx scheduler, so
free_by_rcu_ttrace and waiting_for_gp_ttrace are empty, and there
is nothing to harvest.

> Why irq-s are disabled? Isn't this for scx ?

In this particular scenario, the IRQ is not disabled. I just
meant such allocation failure can happen easily with excessive
allocation when IRQ is disabled.

>> (e.g., bpf_cpumask_create), allocate the additional free entry in an atomic
>> manner (atomic = true in alloc_bulk).
> 
> ...
>> +       if (unlikely(!llnode && !retry)) {
>> +               int cpu = smp_processor_id();
>> +               alloc_bulk(c, 1, cpu_to_node(cpu), true);
> 
> This is broken.
> Passing atomic doesn't help.
> unit_alloc() can be called from any context
> including NMI/IRQ/kprobe deeply nested in slab internals.
> kmalloc() is not safe from there.
> The whole point of bpf_mem_alloc() is to be safe from
> unknown context. If we could do kmalloc(GFP_NOWAIT)
> everywhere bpf_mem_alloc() would be needed.

I didn't think about the NMI case, where GFP_NOWAIT and GFP_ATOMIC are 
not safe.

Hmm.. maybe, we can extend 'bpf_mem_alloc_init()' or 'struct
bpf_mem_alloc' to specify the (initial) prefill count. This way
we can set a bit larger prefill count (say 8) for bpf cpumask.
What do you think?

[1] 
https://github.com/sched-ext/scx/blob/f17985cac0a60ba0136bbafa3f546db2b966cec0/scheds/rust/scx_lavd/src/bpf/main.bpf.c#L1970

Regards,
Changwoo Min