linux-kernel - Re: [PATCH v5 13/17] perf: Support deferred user callchains

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250508120321.20677bc6@gandalf.local.home>
Date: Thu, 8 May 2025 12:03:21 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
 Namhyung Kim <namhyung@...nel.org>
Cc: Masami Hiramatsu <mhiramat@...nel.org>, Mark Rutland
 <mark.rutland@....com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Josh Poimboeuf
 <jpoimboe@...nel.org>, x86@...nel.org, Peter Zijlstra
 <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, Arnaldo Carvalho de
 Melo <acme@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>, Alexander
 Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa
 <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>, Adrian Hunter
 <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org, Mark Brown
 <broonie@...nel.org>, linux-toolchains@...r.kernel.org, Jordan Rome
 <jordalgo@...a.com>, Sam James <sam@...too.org>, Andrii Nakryiko
 <andrii.nakryiko@...il.com>, Jens Remus <jremus@...ux.ibm.com>, Florian
 Weimer <fweimer@...hat.com>, Andy Lutomirski <luto@...nel.org>, Weinan Liu
 <wnliu@...gle.com>, Blake Jones <blakejones@...gle.com>, Beau Belgrave
 <beaub@...ux.microsoft.com>, "Jose E. Marchesi" <jemarch@....org>
Subject: Re: [PATCH v5 13/17] perf: Support deferred user callchains

On Thu, 24 Apr 2025 12:25:42 -0400
Steven Rostedt <rostedt@...dmis.org> wrote:

> +static void perf_event_callchain_deferred(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct unwind_stacktrace trace;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	u64 nr;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_deferred_trace(&trace) < 0)
> +		goto out;
> +
> +	/*
> +	 * All accesses to the event must belong to the same implicit RCU
> +	 * read-side critical section as the ->pending_unwind_callback reset.
> +	 * See comment in perf_pending_unwind_sync().
> +	 */
> +	guard(rcu)();
> +
> +	if (!current->mm)
> +		goto out;
> +
> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */

Hi Namhyung,

Talking with Beau about how Microsoft does their own deferred tracing, I
wonder if the timestamp approach would be useful.

This is where a timestamp is taken at the first request for a deferred
trace, and this is recorded in the trace when it happens. It basically
states that "this trace is good up until the given timestamp".

The rationale for this is for lost events. Let's say you have:

  <task enters kernel>
    Request deferred trace

    <buffer fills up and events start to get lost>

    Deferred trace happens (but is dropped due to buffer being full)

  <task exits kernel>

  <task enters kernel again>
    Request deferred trace  (Still dropped due to buffer being full)

    <Reader catches up and buffer is free again>

    Deferred trace happens (this time it is recorded>
  <task exits kernel>

How would user space know that the deferred trace that was recorded doesn't
go with the request (and kernel stack trace) that was done initially)?

If we add a timestamp, then it would look like:

  <task enters kernel>
    Request deferred trace
    [Record timestamp]

    <buffer fills up and events start to get lost>

    Deferred trace happens with timestamp (but is dropped due to buffer being full)

  <task exits kernel>

  <task enters kernel again>
    Request deferred trace  (Still dropped due to buffer being full)
    [Record timestamp]

    <Reader catches up and buffer is free again>

    Deferred trace happens with timestamp (this time it is recorded>
  <task exits kernel>

Then user space will look at the timestamp that was recorded and know that
it's not for the initial request because the timestamp of the kernel stack
trace done was before the timestamp of the user space stacktrace and
therefore is not valid for the kernel stacktrace.

The timestamp would become zero when exiting to user space. The first
request will add it but would need a cmpxchg to do so, and if the cmpxchg
fails, it then needs to check if the one recorded is before the current
one, and if it isn't it still needs to update the timestamp (this is to
handle races with NMIs).

Basically, the timestamp would replace the cookie method.

Thoughts?

-- Steve


> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.nr = nr;
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		goto out;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64));
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +
> +out:
> +	event->pending_unwind_callback = 0;
> +	local_dec(&event->ctx->nr_no_switch_fast);
> +	rcuwait_wake_up(&event->pending_unwind_wait);
> +}
> +