linux-kernel - Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241029183440.fbwoistveyxneezt@treble.attlocal.net>
Date: Tue, 29 Oct 2024 11:34:40 -0700
From: Josh Poimboeuf <jpoimboe@...nel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	linux-kernel@...r.kernel.org, Indu Bhagat <indu.bhagat@...cle.com>,
	Mark Rutland <mark.rutland@....com>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
	Ian Rogers <irogers@...gle.com>,
	Adrian Hunter <adrian.hunter@...el.com>,
	linux-perf-users@...r.kernel.org, Mark Brown <broonie@...nel.org>,
	linux-toolchains@...r.kernel.org, Jordan Rome <jordalgo@...a.com>,
	Sam James <sam@...too.org>, linux-trace-kernel@...r.kerne.org,
	Andrii Nakryiko <andrii.nakryiko@...il.com>,
	Jens Remus <jremus@...ux.ibm.com>,
	Florian Weimer <fweimer@...hat.com>,
	Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API

On Tue, Oct 29, 2024 at 01:47:59PM -0400, Mathieu Desnoyers wrote:
> > > If ftrace needs brain damage like this, can't we push this to the user?
> > > 
> > > That is, do away with the per-cpu sequence crap, and add a per-task
> > > counter that is incremented for every return-to-userspace.
> > 
> > That would definitely make things easier for me, though IIRC Steven and
> > Mathieu had some concerns about TID wrapping over days/months/years.
> > 
> > With that mindset I suppose the per-CPU counter could also wrap, though
> > that could be mitigated by making the cookie a struct with more bits.
> > 
> 
> AFAIR, the scheme we discussed in Prague was different than the
> implementation here.

It does differ a bit.  I'll explain why below.

> We discussed having a free-running counter per-cpu, and combining it
> with the cpu number as top (or low) bits, to effectively make a 64-bit
> value that is unique across the entire system, but without requiring a
> global counter with its associated cache line bouncing.
> 
> Here is part where the implementation here differs from our discussion:
> I recall we discussed keeping a snapshot of the counter value within
> the task struct of the thread. So we only snapshot the per-cpu value
> on first use after entering the kernel, and after that we use the same
> per-cpu value snapshot (from task struct) up until return to userspace.
> We clear the task struct cookie snapshot on return to userspace.

Right, so adding some details to this, what I remember specifically
deciding on:

 - In unwind_user_deferred(), if task cookie is 0, it snapshots the
   per-cpu counter, stores the old value in the task cookie, and
   increments the new value (with CPU # in top 48 bits).

 - Future calls to unwind_user_deferred() see the task cookie is set and
   reuse the same cookie.

 - Then the task work (which runs right before exiting the kernel)
   clears the task cookie (with interrupts disabled), does the unwind,
   and calls the callbacks.  It clears the task cookie so that any
   future calls to unwind_user_deferred() (either before exiting the
   kernel or after next entry) are guaranteed to get an unwind.

That's what I had implemented originally.  It had a major performance
issue, particularly for long stacks (bash sometimes had 300+ stack
frames in my testing).

The task work runs with interrupts enabled.  So if another PMU interrupt
and call to unwind_user_deferred() happens after the task work clears
the task cookie but before kernel exit, a new cookie is generated and an
additional task work is scheduled.  For long stacks, performance gets
really bad, dominated by all the extra unnecessary unwinding.

So I changed the design a bit:

  - Increment a per-cpu counter at kernel entry before interrupts are
    enabled.

  - In unwind_user_deferred(), if task cookie is 0, it sets the task
    cookie based on the per-cpu counter, like before.  However if this
    cookie was the last one used by this callback+task, it skips the
    callback altogether.

So it eliminates duplicate unwinds except for the CPU migration case.

If I change the entry code to increment a per-task counter instead of a
per-cpu counter then this problem goes away.  I was just concerned about
the performance impacts of doing that on every entry.

-- 
Josh