lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <486e8f79-4836-4971-8876-45aee31ffda9@efficios.com>
Date: Wed, 20 Aug 2025 10:10:30 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Wei Liu <wei.liu@...nel.org>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [patch 00/11] rseq: Optimize exit to user space

On 2025-08-13 12:29, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
> 
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
> 
>    1) task::rseq_event_mask is a pointless bitfield despite the fact that
>       the ABI flags it was meant to support have been deprecated and
>       functionally disabled three years ago.

Correct. We could replace this mask by a single flag, set when the
events happen, test-and-cleared on return to userspace. This would
lessen the synchronization requirements because bitops are not needed
anymore.

> 
>    2) task::rseq_event_mask is accumulating bits unless there is a critical
>       section discovered in the user space rseq memory. This results in
>       pointless invocations of the rseq user space exit handler even if
>       there had nothing changed. As a matter of correctness these bits have
>       to be clear when exiting to user space and therefore pristine when
>       coming back into the kernel. Aside of correctness, this also avoids
>       pointless evaluation of the user space memory, which is a performance
>       benefit.

Ouch. Yes. The intent here is indeed to clear those bits in all cases,
whether there is an rseq critical section ongoing or not. Good catch!

> 
>    3) The evaluation of critical sections does not differentiate between
>       syscall and interrupt/exception exits. The current implementation
>       silently fixes up critical sections which invoked a syscall unless
>       CONFIG_DEBUG_RSEQ is enabled.
> 
>       That's just wrong. If user space does that on a production kernel it
>       can keep the pieces. The kernel is not there to proliferate mindless
>       user space programming and letting everyone pay the performance
>       penalty.

Agreed. There is no need to fixup the rseq critical section, however we
still need to update other rseq fields (e.g. current CPU number) when
returning from a system call.

> 
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.

Nice!

I'll try to have a closer look at the series soon.

Thanks,

Mathieu

> 
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
> 
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
> 
>     https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
> 
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.
> 
> The series depends on the seperately posted rseq bugfix:
> 
>     https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
> 
> and the uaccess generic helper series:
> 
>     https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
> 
> and a related futex fix in
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
> 
> For your conveniance it is also available as a conglomorate from git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> Thanks,
> 
> 	tglx
> ---
>   drivers/hv/mshv_common.c         |    2
>   include/linux/entry-common.h     |   12 +--
>   include/linux/irq-entry-common.h |   31 ++++++--
>   include/linux/resume_user_mode.h |   40 +++++++---
>   include/linux/rseq.h             |  115 +++++++++----------------------
>   include/linux/sched.h            |   10 +-
>   include/uapi/linux/rseq.h        |   21 +----
>   io_uring/io_uring.h              |    2
>   kernel/entry/common.c            |    9 +-
>   kernel/entry/kvm.c               |    2
>   kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
>   kernel/sched/core.c              |    8 +-
>   kernel/sched/membarrier.c        |    8 +-
>   13 files changed, 213 insertions(+), 189 deletions(-)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ