[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBUoF8DyzqmiW5vk@google.com>
Date: Fri, 2 May 2025 13:16:23 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
Masami Hiramatsu <mhiramat@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Indu Bhagat <indu.bhagat@...cle.com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
linux-perf-users@...r.kernel.org, Mark Brown <broonie@...nel.org>,
linux-toolchains@...r.kernel.org, Jordan Rome <jordalgo@...a.com>,
Sam James <sam@...too.org>,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Jens Remus <jremus@...ux.ibm.com>,
Florian Weimer <fweimer@...hat.com>,
Andy Lutomirski <luto@...nel.org>, Weinan Liu <wnliu@...gle.com>,
Blake Jones <blakejones@...gle.com>,
Beau Belgrave <beaub@...ux.microsoft.com>,
"Jose E. Marchesi" <jemarch@....org>,
Alexander Aring <aahringo@...hat.com>
Subject: Re: [PATCH v6 5/5] perf: Support deferred user callchains for per
CPU events
On Thu, May 01, 2025 at 04:57:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 May 2025 13:14:11 -0700
> Namhyung Kim <namhyung@...nel.org> wrote:
>
> > Hi Steve,
> >
> > On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
>
> > > To solve this, when a per CPU event is created that has defer_callchain
> > > attribute set, it will do a lookup from a global list
> > > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > > id that matches the PID of the current task's group_leader.
> >
> > Nice, it'd work well with the perf tools at least.
>
> Cool!
>
>
>
> > > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > > + struct unwind_stacktrace *trace, u64 cookie)
> > > +{
> > > + struct perf_unwind_deferred *defer =
> > > + container_of(work, struct perf_unwind_deferred, unwind_work);
> > > + struct perf_unwind_cpu *cpu_unwind;
> > > + struct perf_event *event;
> > > + int cpu;
> > > +
> > > + guard(rcu)();
> > > + guard(preempt)();
> > > +
> > > + cpu = smp_processor_id();
> > > + cpu_unwind = &defer->cpu_events[cpu];
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 1);
> > > + /*
> > > + * Make sure the above is seen for the rcuwait in
> > > + * perf_remove_unwind_deferred() before iterating the loop.
> > > + */
> > > + smp_mb();
> > > +
> > > + list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > > + perf_event_callchain_deferred(event, trace);
> > > + /* Only the first CPU event gets the trace */
> > > + break;
> >
> > I guess this is to emit a callchain record when more than one events
> > requested the deferred callchains for the same task like:
> >
> > $ perf record -a -e cycles,instructions
> >
> > right?
>
> Yeah. If perf assigns more than one per CPU event, we only need one of
> those events to record the deferred trace, not both of them.
>
> But I keep a link list so that if the program closes the first one and
> keeps the second active, this will still work, as the first one would be
> removed from the list, and the second one would pick up the tracing after
> that.
Makes sense.
>
> >
> >
> > > + }
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 0);
> > > + rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > > +}
> > > +
> > > static void perf_free_addr_filters(struct perf_event *event);
> > >
> > > /* vs perf_event_alloc() error */
> > > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
> > > return 0;
> > > }
> > >
> > > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > > +{
> > > + u64 cookie;
> > > + int ret;
> > > +
> > > + ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > > + return ret < 0 ? ret : 0;
> > > +}
> > > +
> > > /*
> > > * Returns:
> > > * > 0 : if already queued.
> > > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
> > > int pending;
> > > int ret;
> > >
> > > - /* Only defer for task events */
> > > - if (!event->ctx->task)
> > > + if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > return -EINVAL;
> > >
> > > - if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > + if (event->unwind_deferred)
> > > + return deferred_unwind_request(event->unwind_deferred);
> > > +
> > > + /* Per CPU events should have had unwind_deferred set! */
> > > + if (WARN_ON_ONCE(!event->ctx->task))
> > > return -EINVAL;
> > >
> > > if (in_nmi())
> > > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> > > }
> > > }
> > >
> > > + /* Setup unwind deferring for per CPU events */
> > > + if (event->attr.defer_callchain && !task) {
> >
> > As I said it should handle per-task and per-CPU events. How about this?
>
> Hmm, I just added some printk()s in this code, and it seems that perf
> record always did per CPU.
Right, that's the default behavior.
>
> But if an event is per CPU and per task, will it still only trace that
> task? It will never trace another task right?
Yes, the event can be inherited to a child but then child will create a
new event so each task will have its own events.
>
> Because the way this is currently implemented is that the event that
> requested the callback is the one that records it, even if it runs on
> another CPU:
>
> In defer_request_nmi():
>
> struct callback_head *work = &event->pending_unwind_work;
> int ret;
>
> if (event->pending_unwind_callback)
> return 1;
>
> ret = task_work_add(current, work, TWA_NMI_CURRENT);
> if (ret)
> return ret;
>
> event->pending_unwind_callback = 1;
>
> The task_work_add() adds the work from the event's pending_unwind_work.
>
> Now the callback will be:
>
> static void perf_event_deferred_task(struct callback_head *work)
> {
> struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
>
> // the above is the event that requested this. This may run on another CPU.
>
> struct unwind_stacktrace trace;
>
> if (!event->pending_unwind_callback)
> return;
>
> if (unwind_deferred_trace(&trace) >= 0) {
>
> /*
> * All accesses to the event must belong to the same implicit RCU
> * read-side critical section as the ->pending_unwind_callback reset.
> * See comment in perf_pending_unwind_sync().
> */
> guard(rcu)();
> perf_event_callchain_deferred(event, &trace);
>
> // The above records the stack trace to that event.
> // Again, this may happen on another CPU.
>
> }
>
> event->pending_unwind_callback = 0;
> local_dec(&event->ctx->nr_no_switch_fast);
> rcuwait_wake_up(&event->pending_unwind_wait);
> }
>
> Is the recording to an event from one CPU to another CPU an issue, if that
> event also is only tracing a task?
IIUC it should be fine as long as you use the unwind descriptor logic
like in the per-CPU case. The data should be written to the current
CPU's ring buffer for per-task and per-CPU events.
>
> >
> > if (event->attr.defer_callchain) {
> > if (event->cpu >= 0) {
> > err = perf_add_unwind_deferred(event);
> > if (err)
> > return ERR_PTR(err);
> > } else {
> > init_task_work(&event->pending_unwind_work,
> > perf_event_callchain_deferred,
> > perf_event_deferred_task);
> > }
> > }
> >
> > > + err = perf_add_unwind_deferred(event);
> > > + if (err)
> > > + return ERR_PTR(err);
> > > + }
> > > +
> > > err = security_perf_event_alloc(event);
> > > if (err)
> > > return ERR_PTR(err);
> > >
> > > if (event->attr.defer_callchain)
> > > init_task_work(&event->pending_unwind_work,
> > > - perf_event_callchain_deferred);
> > > + perf_event_deferred_task);
> >
> > And you can remove here.
>
> There's nothing wrong with always initializing it. It will just never be
> called.
Ok.
>
> What situation do we have where cpu is negative? What's the perf command?
> Is there one?
Yep, there's --per-thread option for just per-task events.
Thanks,
Namhyung
Powered by blists - more mailing lists