linux-kernel - Re: [RFC PATCH bpf-next v2 2/2] bpf: Pass external callchain entry to get_perf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEf4BzbtU2m9mh+Wi-BvuJ7U5_oHL3TWB8w2M5pRO6w6CCbfVw@mail.gmail.com>
Date: Tue, 21 Oct 2025 09:37:17 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Tao Chen <chen.dylane@...ux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>, Jiri Olsa <olsajiri@...il.com>, 
	Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Ian Rogers <irogers@...gle.com>, 
	Adrian Hunter <adrian.hunter@...el.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	Song Liu <song@...nel.org>, Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Eduard <eddyz87@...il.com>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, 
	"linux-perf-use." <linux-perf-users@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	bpf <bpf@...r.kernel.org>
Subject: Re: [RFC PATCH bpf-next v2 2/2] bpf: Pass external callchain entry to get_perf_callchain

On Sat, Oct 18, 2025 at 12:51 AM Tao Chen <chen.dylane@...ux.dev> wrote:
>
> 在 2025/10/17 04:39, Andrii Nakryiko 写道:
> > On Tue, Oct 14, 2025 at 8:02 AM Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> >>
> >> On Tue, Oct 14, 2025 at 5:14 AM Jiri Olsa <olsajiri@...il.com> wrote:
> >>>
> >>> On Tue, Oct 14, 2025 at 06:01:28PM +0800, Tao Chen wrote:
> >>>> As Alexei noted, get_perf_callchain() return values may be reused
> >>>> if a task is preempted after the BPF program enters migrate disable
> >>>> mode. Drawing on the per-cpu design of bpf_perf_callchain_entries,
> >>>> stack-allocated memory of bpf_perf_callchain_entry is used here.
> >>>>
> >>>> Signed-off-by: Tao Chen <chen.dylane@...ux.dev>
> >>>> ---
> >>>>   kernel/bpf/stackmap.c | 19 +++++++++++--------
> >>>>   1 file changed, 11 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >>>> index 94e46b7f340..acd72c021c0 100644
> >>>> --- a/kernel/bpf/stackmap.c
> >>>> +++ b/kernel/bpf/stackmap.c
> >>>> @@ -31,6 +31,11 @@ struct bpf_stack_map {
> >>>>        struct stack_map_bucket *buckets[] __counted_by(n_buckets);
> >>>>   };
> >>>>
> >>>> +struct bpf_perf_callchain_entry {
> >>>> +     u64 nr;
> >>>> +     u64 ip[PERF_MAX_STACK_DEPTH];
> >>>> +};
> >>>> +
> >
> > we shouldn't introduce another type, there is perf_callchain_entry in
> > linux/perf_event.h, what's the problem with using that?
>
> perf_callchain_entry uses flexible array, DEFINE_PER_CPU seems do not
> create buffer for this, for ease of use, the size of the ip array has
> been explicitly defined.
>
> struct perf_callchain_entry {
>          u64                             nr;
>          u64                             ip[]; /*
> /proc/sys/kernel/perf_event_max_stack */
> };
>

Ok, fair enough, but instead of casting between perf_callchain_entry
and bpf_perf_callchain_entry, why not put perf_callchain_entry inside
bpf_perf_callchain_entry as a first struct and pass a pointer to it.
That looks a bit more appropriate? Though I'm not sure if compiler
will complain about that flex array...

But on related note, I looked briefly at how perf gets those
perf_callchain_entries, and it does seem like it also has a small
stack of entries, so maybe we don't really need to invent anything
here. See PERF_NR_CONTEXTS and how that's used.

If instead of disabling preemption we disable migration, then I think
we should be good with relying on perf's callchain management, or am I
missing something?

> >
> >>>>   static inline bool stack_map_use_build_id(struct bpf_map *map)
> >>>>   {
> >>>>        return (map->map_flags & BPF_F_STACK_BUILD_ID);
> >>>> @@ -305,6 +310,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> >>>>        bool user = flags & BPF_F_USER_STACK;
> >>>>        struct perf_callchain_entry *trace;
> >>>>        bool kernel = !user;
> >>>> +     struct bpf_perf_callchain_entry entry = { 0 };
> >>>
> >>> so IIUC having entries on stack we do not need to do preempt_disable
> >>> you had in the previous version, right?
> >>>
> >>> I saw Andrii's justification to have this on the stack, I think it's
> >>> fine, but does it have to be initialized? it seems that only used
> >>> entries are copied to map
> >>
> >> No. We're not adding 1k stack consumption.
> >
> > Right, and I thought we concluded as much last time, so it's a bit
> > surprising to see this in this patch.
> >
>
> Ok, I feel like I'm missing some context from our previous exchange.
>
> > Tao, you should go with 3 entries per CPU used in a stack-like
> > fashion. And then passing that entry into get_perf_callchain() (to
> > avoid one extra copy).
> >
>
> Got it. It is more clearer, will change it in v3.
>
> >>
> >> pw-bot: cr
>
>
> --
> Best Regards
> Tao Chen