[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZ2Fg+AmFA-K3YODE27br+e0-rLJwn0M5XEwfEHqpPKgQ@mail.gmail.com>
Date: Wed, 17 Sep 2025 15:16:15 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Tao Chen <chen.dylane@...ux.dev>
Cc: ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com,
andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com, song@...nel.org,
yonghong.song@...ux.dev, kpsingh@...nel.org, sdf@...ichev.me,
haoluo@...gle.com, jolsa@...nel.org, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
On Tue, Sep 9, 2025 at 9:32 AM Tao Chen <chen.dylane@...ux.dev> wrote:
>
> The stacktrace map can be easily full, which will lead to failure in
> obtaining the stack. In addition to increasing the size of the map,
> another solution is to delete the stack_id after looking it up from
This is not a "solution", really. Just another partially broken
workaround to an already broken STACKMAP map (and there is no fixing
it, IMO).
When a user is doing lookup_and_delete for that element, another BPF
program can reuse that slot, and user space will delete the stack that
is, effectively, still in use.
<rant>
I also just looked at the bpf_stackmap_copy() implementation, and even
lookup behavior appears broken. We temporarily remove the bucket while
copying, just to put it back a bit later. Meanwhile another BPF
program can put something into that bucket and we'll just silently
discard that new value later. That was a new one for me, but whatever,
as I said STACKMAP cannot be used reliably anyways.
</rant>
But let's stay constructive here. Some vague proposals below.
Really, STACKMAP should have used some form of refcounting and let
users put those refs, instead of just unconditionally removing the
element. I wonder if we can retrofit this and treat lookup/delete as
get/put instead? This would work well for a typical use pattern where
we send stack_id through ringbuf of some sort and user space fetches
stack trace by that ID. Each bpf_get_stackid() would be treated as
refcount bump, and each lookup_and_delete or just delete would be
treated as refcount put.
Also, it would be better for STACKMAP to be a proper hashmap and
handle collisions properly.
The above two changes would probably make STACKMAP actually usable as
"a stack trace bank" producing 4-byte IDs that are easily added to
fixed-sized ringbuf samples as an extra field. This model sometimes is
way more convenient than getting bpf_get_stack() and copying it into
ringbuf (which is currently the only way to have reliable stack traces
with BPF, IMO).
So, tl;dr. Maybe instead of pretending like we are fixing something
about STACKMAP with slightly-more-atomic (but not really)
lookup_and_delete support, maybe let's try to actually make STACKMAP
usable?.. (it's way harder than this patch, but has more value, IMO)
What does everyone think?
P.S. It seems like a good idea to switch STACKMAP to open addressing
instead of the current kind-of-bucket-chain-but-not-really
implementation. It's fixed size and pre-allocated already, so open
addressing seems like a great approach here, IMO.
> the user, so extend the existing bpf_map_lookup_and_delete_elem()
> functionality to stacktrace map types.
>
> Signed-off-by: Tao Chen <chen.dylane@...ux.dev>
> ---
> include/linux/bpf.h | 2 +-
> kernel/bpf/stackmap.c | 16 ++++++++++++++--
> kernel/bpf/syscall.c | 8 +++++---
> 3 files changed, 20 insertions(+), 6 deletions(-)
>
As for the patch in question, I think the logic is correct :) I find
bpf_stackmap_copy_and_delete() name bad and misleading, though,
because it's more of "maybe also delete". Maybe
bpf_stackmap_extract()? Don't know, it's minor nit anyways.
[...]
Powered by blists - more mailing lists