linux-kernel - Re: [PATCH v5 13/17] perf: Support deferred user callchains

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89c62296-fbe4-4d9d-a2ec-19c4ca0c14b2@efficios.com>
Date: Thu, 8 May 2025 14:49:59 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, Namhyung Kim <namhyung@...nel.org>
Cc: Masami Hiramatsu <mhiramat@...nel.org>,
 Mark Rutland <mark.rutland@....com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org,
 Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Indu Bhagat <indu.bhagat@...cle.com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org,
 Mark Brown <broonie@...nel.org>, linux-toolchains@...r.kernel.org,
 Jordan Rome <jordalgo@...a.com>, Sam James <sam@...too.org>,
 Andrii Nakryiko <andrii.nakryiko@...il.com>,
 Jens Remus <jremus@...ux.ibm.com>, Florian Weimer <fweimer@...hat.com>,
 Andy Lutomirski <luto@...nel.org>, Weinan Liu <wnliu@...gle.com>,
 Blake Jones <blakejones@...gle.com>,
 Beau Belgrave <beaub@...ux.microsoft.com>, "Jose E. Marchesi"
 <jemarch@....org>
Subject: Re: [PATCH v5 13/17] perf: Support deferred user callchains

On 2025-05-08 12:03, Steven Rostedt wrote:
> On Thu, 24 Apr 2025 12:25:42 -0400
> Steven Rostedt <rostedt@...dmis.org> wrote:
> 
>> +static void perf_event_callchain_deferred(struct callback_head *work)
>> +{
>> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
>> +	struct perf_callchain_deferred_event deferred_event;
>> +	u64 callchain_context = PERF_CONTEXT_USER;
>> +	struct unwind_stacktrace trace;
>> +	struct perf_output_handle handle;
>> +	struct perf_sample_data data;
>> +	u64 nr;
>> +
>> +	if (!event->pending_unwind_callback)
>> +		return;
>> +
>> +	if (unwind_deferred_trace(&trace) < 0)
>> +		goto out;
>> +
>> +	/*
>> +	 * All accesses to the event must belong to the same implicit RCU
>> +	 * read-side critical section as the ->pending_unwind_callback reset.
>> +	 * See comment in perf_pending_unwind_sync().
>> +	 */
>> +	guard(rcu)();
>> +
>> +	if (!current->mm)
>> +		goto out;
>> +
>> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> 
> Hi Namhyung,
> 
> Talking with Beau about how Microsoft does their own deferred tracing, I
> wonder if the timestamp approach would be useful.
> 
> This is where a timestamp is taken at the first request for a deferred
> trace, and this is recorded in the trace when it happens. It basically
> states that "this trace is good up until the given timestamp".
> 
> The rationale for this is for lost events. Let's say you have:
> 
>    <task enters kernel>
>      Request deferred trace
> 
>      <buffer fills up and events start to get lost>
> 
>      Deferred trace happens (but is dropped due to buffer being full)
> 
>    <task exits kernel>
> 
>    <task enters kernel again>
>      Request deferred trace  (Still dropped due to buffer being full)
> 
>      <Reader catches up and buffer is free again>
> 
>      Deferred trace happens (this time it is recorded>
>    <task exits kernel>
> 
> How would user space know that the deferred trace that was recorded doesn't
> go with the request (and kernel stack trace) that was done initially)?
> 
> If we add a timestamp, then it would look like:
> 
>    <task enters kernel>
>      Request deferred trace
>      [Record timestamp]
> 
>      <buffer fills up and events start to get lost>
> 
>      Deferred trace happens with timestamp (but is dropped due to buffer being full)
> 
>    <task exits kernel>
> 
>    <task enters kernel again>
>      Request deferred trace  (Still dropped due to buffer being full)
>      [Record timestamp]
> 
>      <Reader catches up and buffer is free again>
> 
>      Deferred trace happens with timestamp (this time it is recorded>
>    <task exits kernel>
> 
> Then user space will look at the timestamp that was recorded and know that
> it's not for the initial request because the timestamp of the kernel stack
> trace done was before the timestamp of the user space stacktrace and
> therefore is not valid for the kernel stacktrace.
> 
> The timestamp would become zero when exiting to user space. The first
> request will add it but would need a cmpxchg to do so, and if the cmpxchg
> fails, it then needs to check if the one recorded is before the current
> one, and if it isn't it still needs to update the timestamp (this is to
> handle races with NMIs).
> 
> Basically, the timestamp would replace the cookie method.
> 
> Thoughts?

AFAIR, the cookie method generates the cookie by combining the cpu
number with a per-cpu count.

This ensures that there are not two cookies emitted at the same time
from two CPUs that have the same value by accident.

How would the timestamp method prevent this ?

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>> +
>> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
>> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
>> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
>> +
>> +	deferred_event.nr = nr;
>> +
>> +	perf_event_header__init_id(&deferred_event.header, &data, event);
>> +
>> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
>> +		goto out;
>> +
>> +	perf_output_put(&handle, deferred_event);
>> +	perf_output_put(&handle, callchain_context);
>> +	perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64));
>> +	perf_event__output_id_sample(event, &handle, &data);
>> +
>> +	perf_output_end(&handle);
>> +
>> +out:
>> +	event->pending_unwind_callback = 0;
>> +	local_dec(&event->ctx->nr_no_switch_fast);
>> +	rcuwait_wake_up(&event->pending_unwind_wait);
>> +}
>> +


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com