[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <486e8f79-4836-4971-8876-45aee31ffda9@efficios.com>
Date: Wed, 20 Aug 2025 10:10:30 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>,
Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
<paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
Wei Liu <wei.liu@...nel.org>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [patch 00/11] rseq: Optimize exit to user space
On 2025-08-13 12:29, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
>
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
>
> 1) task::rseq_event_mask is a pointless bitfield despite the fact that
> the ABI flags it was meant to support have been deprecated and
> functionally disabled three years ago.
Correct. We could replace this mask by a single flag, set when the
events happen, test-and-cleared on return to userspace. This would
lessen the synchronization requirements because bitops are not needed
anymore.
>
> 2) task::rseq_event_mask is accumulating bits unless there is a critical
> section discovered in the user space rseq memory. This results in
> pointless invocations of the rseq user space exit handler even if
> there had nothing changed. As a matter of correctness these bits have
> to be clear when exiting to user space and therefore pristine when
> coming back into the kernel. Aside of correctness, this also avoids
> pointless evaluation of the user space memory, which is a performance
> benefit.
Ouch. Yes. The intent here is indeed to clear those bits in all cases,
whether there is an rseq critical section ongoing or not. Good catch!
>
> 3) The evaluation of critical sections does not differentiate between
> syscall and interrupt/exception exits. The current implementation
> silently fixes up critical sections which invoked a syscall unless
> CONFIG_DEBUG_RSEQ is enabled.
>
> That's just wrong. If user space does that on a production kernel it
> can keep the pieces. The kernel is not there to proliferate mindless
> user space programming and letting everyone pay the performance
> penalty.
Agreed. There is no need to fixup the rseq critical section, however we
still need to update other rseq fields (e.g. current CPU number) when
returning from a system call.
>
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.
Nice!
I'll try to have a closer look at the series soon.
Thanks,
Mathieu
>
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
>
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
>
> https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
>
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.
>
> The series depends on the seperately posted rseq bugfix:
>
> https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
>
> and the uaccess generic helper series:
>
> https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
>
> and a related futex fix in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
>
> For your conveniance it is also available as a conglomorate from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
>
> Thanks,
>
> tglx
> ---
> drivers/hv/mshv_common.c | 2
> include/linux/entry-common.h | 12 +--
> include/linux/irq-entry-common.h | 31 ++++++--
> include/linux/resume_user_mode.h | 40 +++++++---
> include/linux/rseq.h | 115 +++++++++----------------------
> include/linux/sched.h | 10 +-
> include/uapi/linux/rseq.h | 21 +----
> io_uring/io_uring.h | 2
> kernel/entry/common.c | 9 +-
> kernel/entry/kvm.c | 2
> kernel/rseq.c | 142 +++++++++++++++++++++++++--------------
> kernel/sched/core.c | 8 +-
> kernel/sched/membarrier.c | 8 +-
> 13 files changed, 213 insertions(+), 189 deletions(-)
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists