linux-kernel - Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250702132605.6c79c1ec@batman.local.home>
Date: Wed, 2 Jul 2025 13:26:05 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
 Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
 Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Namhyung Kim
 <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, Andrii
 Nakryiko <andrii@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>, "Jose
 E. Marchesi" <jemarch@....org>, Beau Belgrave <beaub@...ux.microsoft.com>,
 Jens Remus <jremus@...ux.ibm.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Jens Axboe <axboe@...nel.dk>, Florian Weimer
 <fweimer@...hat.com>
Subject: Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding
 interface

On Wed, 2 Jul 2025 09:56:39 -0700
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> Also, does it actually have to be entirely unique? IOW, a 32-bit
> counter (or even less) might be sufficient if there's some guarantee
> that processing happens before the counter wraps around? Again - for
> correlation purposes, just *how* many outstanding events can you have
> that aren't ordered by other things too?
> 
> I'm sure people want to also get some kind of rough time idea, but
> don't most perf events have them simply because people want time
> information for _informatioal_ reasons, rather than to correlate two
> events?

And it only needs to be unique per thread per system call. The real
reason for this identifier is for lost events. As I explained in the
perf patchset, the issues is this:

In case of dropped events, we could have the case of:

  system_call() {
    <nmi> {
      take kernel stack trace
      ask for deferred trace.

   [EVENTS START DROPPING HERE]
    }
    Call deferred callback to record trace [ BUT IS DROPPED ]
  }

  system_call() {
    <nmi> {
      take kernel stack trace
      ask for deferred trace [ STILL DROPPING ]
    }
    [ READER CATCHES UP AND STARTS READING EVENTS AGAIN]

    Call deferred callback to record trace
  }

The user space tool will see that kernel stack traces of the first
system call, then it will see events dropped, and then it will see the
deferred user space stack trace of the second call.

The identifier is only there for uniqueness for that one thread to let
the tracer know if the deferred trace can be tied to events before it
lost them.

We figured a single 32 bit counter would be good enough when we first
discussed this idea, but we wanted per cpu counters to not have cache
contention every time a CPU wanted to increment the  counter. But each
CPU would need an identifier so that a task migrating will not get the
same identifier for a different system call just because it migrated.

We used 16 bits for the CPU counter thinking that 32K of CPUs would
last some time in the future. We then chose to use a 64 bit number to
allow us to have 48 bits left for uniqueness which is plenty.

If we use 32 bits, that would give us 32K of unique systemcalls, and it
does seem possible that on a busy system, a tracer could lose 32K of
system calls before it gets going again. But we could still use it
anyway as the likelihood of losing exactly 32K of system calls and
starting tracing back up again will probably never happen. And if it
does, the worse thing that it will do is have the tracer mistake which
user space stack trace goes to which event. If your are tracing that
many events, this will likely be in the noise.

So I'm fine with making this a 32 bit counter using 16 bits for the CPU
and 16 bits for per thread uniqueness.

-- Steve