linux-kernel - Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBUoF8DyzqmiW5vk@google.com>
Date: Fri, 2 May 2025 13:16:23 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Indu Bhagat <indu.bhagat@...cle.com>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
	Adrian Hunter <adrian.hunter@...el.com>,
	linux-perf-users@...r.kernel.org, Mark Brown <broonie@...nel.org>,
	linux-toolchains@...r.kernel.org, Jordan Rome <jordalgo@...a.com>,
	Sam James <sam@...too.org>,
	Andrii Nakryiko <andrii.nakryiko@...il.com>,
	Jens Remus <jremus@...ux.ibm.com>,
	Florian Weimer <fweimer@...hat.com>,
	Andy Lutomirski <luto@...nel.org>, Weinan Liu <wnliu@...gle.com>,
	Blake Jones <blakejones@...gle.com>,
	Beau Belgrave <beaub@...ux.microsoft.com>,
	"Jose E. Marchesi" <jemarch@....org>,
	Alexander Aring <aahringo@...hat.com>
Subject: Re: [PATCH v6 5/5] perf: Support deferred user callchains for per
 CPU events

On Thu, May 01, 2025 at 04:57:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 May 2025 13:14:11 -0700
> Namhyung Kim <namhyung@...nel.org> wrote:
> 
> > Hi Steve,
> > 
> > On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
> 
> > > To solve this, when a per CPU event is created that has defer_callchain
> > > attribute set, it will do a lookup from a global list
> > > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > > id that matches the PID of the current task's group_leader.  
> > 
> > Nice, it'd work well with the perf tools at least.
> 
> Cool!
> 
> 
> 
> > > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > > +				    struct unwind_stacktrace *trace, u64 cookie)
> > > +{
> > > +	struct perf_unwind_deferred *defer =
> > > +		container_of(work, struct perf_unwind_deferred, unwind_work);
> > > +	struct perf_unwind_cpu *cpu_unwind;
> > > +	struct perf_event *event;
> > > +	int cpu;
> > > +
> > > +	guard(rcu)();
> > > +	guard(preempt)();
> > > +
> > > +	cpu = smp_processor_id();
> > > +	cpu_unwind = &defer->cpu_events[cpu];
> > > +
> > > +	WRITE_ONCE(cpu_unwind->processing, 1);
> > > +	/*
> > > +	 * Make sure the above is seen for the rcuwait in
> > > +	 * perf_remove_unwind_deferred() before iterating the loop.
> > > +	 */
> > > +	smp_mb();
> > > +
> > > +	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > > +		perf_event_callchain_deferred(event, trace);
> > > +		/* Only the first CPU event gets the trace */
> > > +		break;  
> > 
> > I guess this is to emit a callchain record when more than one events
> > requested the deferred callchains for the same task like:
> > 
> >   $ perf record -a -e cycles,instructions
> > 
> > right?
> 
> Yeah. If perf assigns more than one per CPU event, we only need one of
> those events to record the deferred trace, not both of them.
> 
> But I keep a link list so that if the program closes the first one and
> keeps the second active, this will still work, as the first one would be
> removed from the list, and the second one would pick up the tracing after
> that.

Makes sense.

> 
> > 
> > 
> > > +	}
> > > +
> > > +	WRITE_ONCE(cpu_unwind->processing, 0);
> > > +	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > > +}
> > > +
> > >  static void perf_free_addr_filters(struct perf_event *event);
> > >  
> > >  /* vs perf_event_alloc() error */
> > > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
> > >  	return 0;
> > >  }
> > >  
> > > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > > +{
> > > +	u64 cookie;
> > > +	int ret;
> > > +
> > > +	ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > > +	return ret < 0 ? ret : 0;
> > > +}
> > > +
> > >  /*
> > >   * Returns:
> > >  *     > 0 : if already queued.
> > > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
> > >  	int pending;
> > >  	int ret;
> > >  
> > > -	/* Only defer for task events */
> > > -	if (!event->ctx->task)
> > > +	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > >  		return -EINVAL;
> > >  
> > > -	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > +	if (event->unwind_deferred)
> > > +		return deferred_unwind_request(event->unwind_deferred);
> > > +
> > > +	/* Per CPU events should have had unwind_deferred set! */
> > > +	if (WARN_ON_ONCE(!event->ctx->task))
> > >  		return -EINVAL;
> > >  
> > >  	if (in_nmi())
> > > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> > >  		}
> > >  	}
> > >  
> > > +	/* Setup unwind deferring for per CPU events */
> > > +	if (event->attr.defer_callchain && !task) {  
> > 
> > As I said it should handle per-task and per-CPU events.  How about this?
> 
> Hmm, I just added some printk()s in this code, and it seems that perf
> record always did per CPU.

Right, that's the default behavior.

> 
> But if an event is per CPU and per task, will it still only trace that
> task? It will never trace another task right?

Yes, the event can be inherited to a child but then child will create a
new event so each task will have its own events.

> 
> Because the way this is currently implemented is that the event that
> requested the callback is the one that records it, even if it runs on
> another CPU:
> 
> In defer_request_nmi():
> 
> 	struct callback_head *work = &event->pending_unwind_work;
> 	int ret;
> 
> 	if (event->pending_unwind_callback)
> 		return 1;
> 
> 	ret = task_work_add(current, work, TWA_NMI_CURRENT);
> 	if (ret)
> 		return ret;
> 
> 	event->pending_unwind_callback = 1;
> 
> The task_work_add() adds the work from the event's pending_unwind_work.
> 
> Now the callback will be:
> 
> static void perf_event_deferred_task(struct callback_head *work)
> {
> 	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> 
> // the above is the event that requested this. This may run on another CPU.
> 
> 	struct unwind_stacktrace trace;
> 
> 	if (!event->pending_unwind_callback)
> 		return;
> 
> 	if (unwind_deferred_trace(&trace) >= 0) {
> 
> 		/*
> 		 * All accesses to the event must belong to the same implicit RCU
> 		 * read-side critical section as the ->pending_unwind_callback reset.
> 		 * See comment in perf_pending_unwind_sync().
> 		 */
> 		guard(rcu)();
> 		perf_event_callchain_deferred(event, &trace);
> 
> // The above records the stack trace to that event.
> // Again, this may happen on another CPU.
> 
> 	}
> 
> 	event->pending_unwind_callback = 0;
> 	local_dec(&event->ctx->nr_no_switch_fast);
> 	rcuwait_wake_up(&event->pending_unwind_wait);
> }
> 
> Is the recording to an event from one CPU to another CPU an issue, if that
> event also is only tracing a task?

IIUC it should be fine as long as you use the unwind descriptor logic
like in the per-CPU case.  The data should be written to the current
CPU's ring buffer for per-task and per-CPU events.

> 
> > 
> > 	if (event->attr.defer_callchain) {
> > 		if (event->cpu >= 0) {
> > 			err = perf_add_unwind_deferred(event);
> > 			if (err)
> > 				return ERR_PTR(err);
> > 		} else {
> > 			init_task_work(&event->pending_unwind_work,
> > 					perf_event_callchain_deferred,
> > 					perf_event_deferred_task);
> > 		}
> > 	}
> > 
> > > +		err = perf_add_unwind_deferred(event);
> > > +		if (err)
> > > +			return ERR_PTR(err);
> > > +	}
> > > +
> > >  	err = security_perf_event_alloc(event);
> > >  	if (err)
> > >  		return ERR_PTR(err);
> > >  
> > >  	if (event->attr.defer_callchain)
> > >  		init_task_work(&event->pending_unwind_work,
> > > -			       perf_event_callchain_deferred);
> > > +			       perf_event_deferred_task);  
> > 
> > And you can remove here.
> 
> There's nothing wrong with always initializing it. It will just never be
> called.

Ok.

> 
> What situation do we have where cpu is negative? What's the perf command?
> Is there one?

Yep, there's --per-thread option for just per-task events.

Thanks,
Namhyung