[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260130113150.GB166857@noisy.programming.kicks-ass.net>
Date: Fri, 30 Jan 2026 12:31:50 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Tao Chen <chen.dylane@...ux.dev>, mingo@...hat.com, acme@...nel.org,
namhyung@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
irogers@...gle.com, adrian.hunter@...el.com,
kan.liang@...ux.intel.com, song@...nel.org, ast@...nel.org,
daniel@...earbox.net, andrii@...nel.org, martin.lau@...ux.dev,
eddyz87@...il.com, yonghong.song@...ux.dev,
john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me,
haoluo@...gle.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
On Wed, Jan 28, 2026 at 11:12:09AM -0800, Andrii Nakryiko wrote:
> On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
> > > From BPF stack map, we want to ensure that the callchain buffer
> > > will not be overwritten by other preemptive tasks and we also aim
> > > to reduce the preempt disable interval, Based on the suggestions from Peter
> > > and Andrrii, export new API __get_perf_callchain and the usage scenarios
> > > are as follows from BPF side:
> > >
> > > preempt_disable()
> > > entry = get_callchain_entry()
> > > preempt_enable()
> > > __get_perf_callchain(entry)
> > > put_callchain_entry(entry)
> >
> > That makes no sense, this means any other task on that CPU is getting
> > screwed over.
>
> Yes, unfortunately, but unless we dynamically allocate new entry each
> time and/or keep per-current entry cached there isn't much choice we
> have here, no?
>
> Maybe that's what we have to do, honestly, because
> get_perf_callchain() usage we have right now from sleepable BPF isn't
> great no matter with or without changes in this patch set.
All of the perf stuff is based on the fact that if we do it from IRQ/NMI
context, we can certainly do it with preemption disabled.
Bending the interface this way is just horrible.
> > Why are you worried about the preempt_disable() here? If this were an
> > interrupt context we'd still do that unwind -- but then with IRQs
> > disabled.
>
> Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called
> from sleepable/faultable context and also we can do a rather expensive
> build ID resolution (either in sleepable or not, which only changes if
> build ID parsing logic waits for file backed pages to be paged in or
> not).
<rant>
So stack_map_get_build_id_offset() is a piece of crap -- and I've always
said it was. And I hate that any of that ever got merged -- its the
pinnacle of bad engineering and simply refusing to do the right thing in
the interest of hack now, fix never :/
</rant>
Anyway, as you well know, we now *do* have lockless vma lookups and
should be able to do this buildid thing much saner.
Also, there appears to be no buildid caching what so ever, surely that
would help some.
(and I'm not sure I've ever understood why the buildid crap needs to be
in this path in any case)
Powered by blists - more mailing lists