linux-kernel - Re: [PATCH v5 13/17] perf: Support deferred user callchains

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBz7mvEQwtlgNUjI@google.com>
Date: Thu, 8 May 2025 11:44:42 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Indu Bhagat <indu.bhagat@...cle.com>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
	Adrian Hunter <adrian.hunter@...el.com>,
	linux-perf-users@...r.kernel.org, Mark Brown <broonie@...nel.org>,
	linux-toolchains@...r.kernel.org, Jordan Rome <jordalgo@...a.com>,
	Sam James <sam@...too.org>,
	Andrii Nakryiko <andrii.nakryiko@...il.com>,
	Jens Remus <jremus@...ux.ibm.com>,
	Florian Weimer <fweimer@...hat.com>,
	Andy Lutomirski <luto@...nel.org>, Weinan Liu <wnliu@...gle.com>,
	Blake Jones <blakejones@...gle.com>,
	Beau Belgrave <beaub@...ux.microsoft.com>,
	"Jose E. Marchesi" <jemarch@....org>
Subject: Re: [PATCH v5 13/17] perf: Support deferred user callchains

Hi Steve,

On Thu, May 08, 2025 at 12:03:21PM -0400, Steven Rostedt wrote:
> On Thu, 24 Apr 2025 12:25:42 -0400
> Steven Rostedt <rostedt@...dmis.org> wrote:
> 
> > +static void perf_event_callchain_deferred(struct callback_head *work)
> > +{
> > +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> > +	struct perf_callchain_deferred_event deferred_event;
> > +	u64 callchain_context = PERF_CONTEXT_USER;
> > +	struct unwind_stacktrace trace;
> > +	struct perf_output_handle handle;
> > +	struct perf_sample_data data;
> > +	u64 nr;
> > +
> > +	if (!event->pending_unwind_callback)
> > +		return;
> > +
> > +	if (unwind_deferred_trace(&trace) < 0)
> > +		goto out;
> > +
> > +	/*
> > +	 * All accesses to the event must belong to the same implicit RCU
> > +	 * read-side critical section as the ->pending_unwind_callback reset.
> > +	 * See comment in perf_pending_unwind_sync().
> > +	 */
> > +	guard(rcu)();
> > +
> > +	if (!current->mm)
> > +		goto out;
> > +
> > +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> 
> Hi Namhyung,
> 
> Talking with Beau about how Microsoft does their own deferred tracing, I
> wonder if the timestamp approach would be useful.
> 
> This is where a timestamp is taken at the first request for a deferred
> trace, and this is recorded in the trace when it happens. It basically
> states that "this trace is good up until the given timestamp".
> 
> The rationale for this is for lost events. Let's say you have:
> 
>   <task enters kernel>
>     Request deferred trace
> 
>     <buffer fills up and events start to get lost>
> 
>     Deferred trace happens (but is dropped due to buffer being full)
> 
>   <task exits kernel>
> 
>   <task enters kernel again>
>     Request deferred trace  (Still dropped due to buffer being full)
> 
>     <Reader catches up and buffer is free again>
> 
>     Deferred trace happens (this time it is recorded>
>   <task exits kernel>
> 
> How would user space know that the deferred trace that was recorded doesn't
> go with the request (and kernel stack trace) that was done initially)?

Right, this is a problem.

> 
> If we add a timestamp, then it would look like:
> 
>   <task enters kernel>
>     Request deferred trace
>     [Record timestamp]
> 
>     <buffer fills up and events start to get lost>
> 
>     Deferred trace happens with timestamp (but is dropped due to buffer being full)
> 
>   <task exits kernel>
> 
>   <task enters kernel again>
>     Request deferred trace  (Still dropped due to buffer being full)
>     [Record timestamp]
> 
>     <Reader catches up and buffer is free again>
> 
>     Deferred trace happens with timestamp (this time it is recorded>
>   <task exits kernel>
> 
> Then user space will look at the timestamp that was recorded and know that
> it's not for the initial request because the timestamp of the kernel stack
> trace done was before the timestamp of the user space stacktrace and
> therefore is not valid for the kernel stacktrace.

IIUC the deferred stacktrace will have the timestamp of the first
request, right?

> 
> The timestamp would become zero when exiting to user space. The first
> request will add it but would need a cmpxchg to do so, and if the cmpxchg
> fails, it then needs to check if the one recorded is before the current
> one, and if it isn't it still needs to update the timestamp (this is to
> handle races with NMIs).

Yep, it needs to maintain an accurate first timestamp.

> 
> Basically, the timestamp would replace the cookie method.
> 
> Thoughts?

Sounds good to me.  You'll need to add it to the
PERF_RECORD_DEFERRED_CALLCHAIN.  Probably it should check if sample_type
has PERF_SAMPLE_TIME.  It'd work along with PERF_SAMPLE_TID (which will
be added by the perf tools anyway).
 
Thanks,
Namhyung

> 
> > +
> > +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> > +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> > +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> > +
> > +	deferred_event.nr = nr;
> > +
> > +	perf_event_header__init_id(&deferred_event.header, &data, event);
> > +
> > +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> > +		goto out;
> > +
> > +	perf_output_put(&handle, deferred_event);
> > +	perf_output_put(&handle, callchain_context);
> > +	perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64));
> > +	perf_event__output_id_sample(event, &handle, &data);
> > +
> > +	perf_output_end(&handle);
> > +
> > +out:
> > +	event->pending_unwind_callback = 0;
> > +	local_dec(&event->ctx->nr_no_switch_fast);
> > +	rcuwait_wake_up(&event->pending_unwind_wait);
> > +}
> > +