linux-kernel - Re: [patch 00/11] rseq: Optimize exit to user space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87y0rh63t0.ffs@tglx>
Date: Sun, 17 Aug 2025 23:23:07 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Jens Axboe <axboe@...nel.dk>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng
 <boqun.feng@...il.com>, Wei Liu <wei.liu@...nel.org>
Subject: Re: [patch 00/11] rseq: Optimize exit to user space

On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
>> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>>> Could you give it a test ride to see whether this makes a difference in
>>> your environment?
>>
>> Yep, I'll give a spin.
>
> Appreciated.

Please do not use the git branch I had in the cover letter. I did some
more analysis of this and it's even worse than I thought. Use

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip

instead.

I've rewritten the whole pile by now and made it a real fast path
without the TIF_NOTIFY horror show, unless the fast path, which runs
_after_ the TIF work loop faults. So far that happens once for each
fork() as that has to fault in the copy of the user space rseq region.

There are lightweight per CPU stats for the various events, which can be
accessed via

      /sys/kernel/debug/rseq/stat

which exposes a racy sum of them. Here is the output after a reboot and
a full kernel recompile

exit:           85703905    // Total invocations
signal:            34635    // Invocations from signal delivery
slowp:               134    // Slow path via TIF_NOTIFY_RESUME
ids:               70052    // Updates of CPU and MM CID
cs:                    0    // Critical section analysis
clear:                 0    // Clearing of critical section
fixup:                 0    // Fixup of critical section (abort)

Before the rewrite this took more than a million of ID updates and
critical section evaluations even when completely pointless.

So on any syscall or interrupt heavy workload this should be clearly
visible as a difference in the profile.

I'm still not happy about the exit to user fast path decision as it's
two conditionals instead of one, but all attempts to do that lightweight
somewhere else turned out to make stuff worse as I just burdened other
fast path operations, i.e. the scheduler with pointless conditionals.

I'll think about that more, but nevertheless this is way better than the
current horror show.

I also have no real good plan yet how to gradually convert this over,
but I'm way too tired to think about that now.

It survives the self test suite after I wasted a day to figure out why
the selftests reliably segfault on a machine which has debian trixie
installed. The fix is in the branch.

Michael, can you please run your librseq tests against that too? They
have the same segfault problem as the kernel and they lack a run script,
so I couldn't be bothered to test against them. See commit 2bff3a0e5998
in that branch. I'll send out a patch with a proper change log later.

Thanks,

        tglx