linux-kernel - Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com>
Date: Wed, 17 Sep 2025 18:35:31 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Tao Chen <chen.dylane@...ux.dev>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>, 
	Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>, 
	Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>, 
	bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE

On Wed, Sep 17, 2025 at 3:16 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
>
> P.S. It seems like a good idea to switch STACKMAP to open addressing
> instead of the current kind-of-bucket-chain-but-not-really
> implementation. It's fixed size and pre-allocated already, so open
> addressing seems like a great approach here, IMO.

That makes sense. It won't have backward compat issues.
Just more reliable stack_id.

Fixed value_size is another footgun there.
Especially for collecting user stack traces.
We can switch the whole stackmap to bpf_mem_alloc()
or wait for kmalloc_nolock().
But it's probably a diminishing return.

bpf_get_stack() also isn't great with a copy into
perf_callchain_entry, then 2nd copy into on stack/percpu buf/ringbuf,
and 3rd copy of correct size into ringbuf (optional).

Also, I just realized we have another nasty race there.
In the past bpf progs were run in preempt disabled context,
but we forgot to adjust bpf_get_stack[id]() helpers when everything
switched to migrate disable.

The return value from get_perf_callchain() may be reused
if another task preempts and requests the stack.
We have partially incorrect comment in __bpf_get_stack() too:
        if (may_fault)
                rcu_read_lock(); /* need RCU for perf's callchain below */

rcu can be preemptable. so rcu_read_lock() makes
trace = get_perf_callchain(...)
accessible, but that per-cpu trace buffer can be overwritten.
It's not an issue for CONFIG_PREEMPT_NONE=y, but that doesn't
give much comfort.

Modern day bpf api would probably be
- get_callchain_entry()/put() kfuncs to expose low level mechanism
with safe acq/rel of temp buffer.
- then another kfuncs to perf_callchain_kernel/user into that buffer.

and with bpf_mem_alloc and hash kfuncs the bpf prog can
implement either bpf_get_stack() equivalent or much better
bpf_get_stackid() with variable length stack traces and so on.