[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d9320eb-6be9-43e4-a63e-08e7ab1427e3@linux.dev>
Date: Thu, 18 Sep 2025 20:45:23 +0800
From: Tao Chen <chen.dylane@...ux.dev>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com,
andrii@...nel.org, martin.lau@...ux.dev, eddyz87@...il.com, song@...nel.org,
yonghong.song@...ux.dev, kpsingh@...nel.org, sdf@...ichev.me,
haoluo@...gle.com, jolsa@...nel.org, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for
BPF_MAP_STACK_TRACE
在 2025/9/18 06:16, Andrii Nakryiko 写道:
> On Tue, Sep 9, 2025 at 9:32 AM Tao Chen <chen.dylane@...ux.dev> wrote:
>>
>> The stacktrace map can be easily full, which will lead to failure in
>> obtaining the stack. In addition to increasing the size of the map,
>> another solution is to delete the stack_id after looking it up from
>
> This is not a "solution", really. Just another partially broken
> workaround to an already broken STACKMAP map (and there is no fixing
> it, IMO).
>
Actually it is. But in our use case, we used continuous profiling with
perf_event, the result looks better when we got the stack_id and deleted
it to alleviate the data size pressure in the map. And there is no high
requirement for the accuracy of stack information for us.
> When a user is doing lookup_and_delete for that element, another BPF
> program can reuse that slot, and user space will delete the stack that
> is, effectively, still in use.
>
> <rant>
>
> I also just looked at the bpf_stackmap_copy() implementation, and even
> lookup behavior appears broken. We temporarily remove the bucket while
> copying, just to put it back a bit later. Meanwhile another BPF
> program can put something into that bucket and we'll just silently
> discard that new value later. That was a new one for me, but whatever,
Yes, it is a problem.
> as I said STACKMAP cannot be used reliably anyways.
>
> </rant>
>
> But let's stay constructive here. Some vague proposals below.
>
> Really, STACKMAP should have used some form of refcounting and let
> users put those refs, instead of just unconditionally removing the
> element. I wonder if we can retrofit this and treat lookup/delete as
> get/put instead? This would work well for a typical use pattern where
> we send stack_id through ringbuf of some sort and user space fetches
> stack trace by that ID. Each bpf_get_stackid() would be treated as
> refcount bump, and each lookup_and_delete or just delete would be
> treated as refcount put.
>
> Also, it would be better for STACKMAP to be a proper hashmap and
> handle collisions properly.
>
> The above two changes would probably make STACKMAP actually usable as
> "a stack trace bank" producing 4-byte IDs that are easily added to
> fixed-sized ringbuf samples as an extra field. This model sometimes is
> way more convenient than getting bpf_get_stack() and copying it into
> ringbuf (which is currently the only way to have reliable stack traces
> with BPF, IMO).
>
> So, tl;dr. Maybe instead of pretending like we are fixing something
> about STACKMAP with slightly-more-atomic (but not really)
> lookup_and_delete support, maybe let's try to actually make STACKMAP
> usable?.. (it's way harder than this patch, but has more value, IMO)
>
The idea looks great. I will try to make improvements in this direction,
though there will be certain challenges for me right now.
> What does everyone think?
>
> P.S. It seems like a good idea to switch STACKMAP to open addressing
> instead of the current kind-of-bucket-chain-but-not-really
> implementation. It's fixed size and pre-allocated already, so open
> addressing seems like a great approach here, IMO.
>
>> the user, so extend the existing bpf_map_lookup_and_delete_elem()
>> functionality to stacktrace map types.
>>
>> Signed-off-by: Tao Chen <chen.dylane@...ux.dev>
>> ---
>> include/linux/bpf.h | 2 +-
>> kernel/bpf/stackmap.c | 16 ++++++++++++++--
>> kernel/bpf/syscall.c | 8 +++++---
>> 3 files changed, 20 insertions(+), 6 deletions(-)
>>
>
> As for the patch in question, I think the logic is correct :) I find
> bpf_stackmap_copy_and_delete() name bad and misleading, though,
> because it's more of "maybe also delete". Maybe
> bpf_stackmap_extract()? Don't know, it's minor nit anyways.
>
> [...]Will rename it in v2. The original idea in this patch was just to
make it convenient for users to delete the stackid when they obtain it.
--
Best Regards
Tao Chen
Powered by blists - more mailing lists