[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250813155941.014821755@linutronix.de>
Date: Wed, 13 Aug 2025 18:29:12 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Peter Zijlstra <peterz@...radead.org>,
"Paul E. McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
Wei Liu <wei.liu@...nel.org>,
Jens Axboe <axboe@...nel.dk>
Subject: [patch 00/11] rseq: Optimize exit to user space
With the more wide spread usage of rseq in glibc, rseq is not longer a
niche use case for special applications.
While working on a sane implementation of a rseq based time slice extension
mechanism, I noticed several shortcomings of the current rseq code:
1) task::rseq_event_mask is a pointless bitfield despite the fact that
the ABI flags it was meant to support have been deprecated and
functionally disabled three years ago.
2) task::rseq_event_mask is accumulating bits unless there is a critical
section discovered in the user space rseq memory. This results in
pointless invocations of the rseq user space exit handler even if
there had nothing changed. As a matter of correctness these bits have
to be clear when exiting to user space and therefore pristine when
coming back into the kernel. Aside of correctness, this also avoids
pointless evaluation of the user space memory, which is a performance
benefit.
3) The evaluation of critical sections does not differentiate between
syscall and interrupt/exception exits. The current implementation
silently fixes up critical sections which invoked a syscall unless
CONFIG_DEBUG_RSEQ is enabled.
That's just wrong. If user space does that on a production kernel it
can keep the pieces. The kernel is not there to proliferate mindless
user space programming and letting everyone pay the performance
penalty.
This series addresses these issues and on top converts parts of the user
space access over to the new masked access model, which lowers the overhead
of Spectre-V1 mitigations significantly on architectures which support it
(x86 as of today). This is especially noticable in the access to the
rseq_cs field in struct rseq, which is the first quick check to figure out
whether a critical section is installed or not.
It survives the kernels rseq selftests, but I did not any performance tests
vs. rseq because I have no idea how to use the gazillion of undocumented
command line parameters of the benchmark. I leave that to people who are so
familiar with them, that they assume everyone else is too :)
The performance gain on regular workloads is clearly measurable and the
consistent event flag state allows now to build the time slice extension
mechanism on top. The first POC I implemented:
https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
suffered badly from the stale eventmask bits and the cleaned up version
brought a whopping 25+% performance gain.
The series depends on the seperately posted rseq bugfix:
https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
and the uaccess generic helper series:
https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
and a related futex fix in
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
For your conveniance it is also available as a conglomorate from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
Thanks,
tglx
---
drivers/hv/mshv_common.c | 2
include/linux/entry-common.h | 12 +--
include/linux/irq-entry-common.h | 31 ++++++--
include/linux/resume_user_mode.h | 40 +++++++---
include/linux/rseq.h | 115 +++++++++----------------------
include/linux/sched.h | 10 +-
include/uapi/linux/rseq.h | 21 +----
io_uring/io_uring.h | 2
kernel/entry/common.c | 9 +-
kernel/entry/kvm.c | 2
kernel/rseq.c | 142 +++++++++++++++++++++++++--------------
kernel/sched/core.c | 8 +-
kernel/sched/membarrier.c | 8 +-
13 files changed, 213 insertions(+), 189 deletions(-)
Powered by blists - more mailing lists