linux-kernel - [patch 00/11] rseq: Optimize exit to user space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250813155941.014821755@linutronix.de>
Date: Wed, 13 Aug 2025 18:29:12 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 Boqun Feng <boqun.feng@...il.com>,
 Wei Liu <wei.liu@...nel.org>,
 Jens Axboe <axboe@...nel.dk>
Subject: [patch 00/11] rseq: Optimize exit to user space

With the more wide spread usage of rseq in glibc, rseq is not longer a
niche use case for special applications.

While working on a sane implementation of a rseq based time slice extension
mechanism, I noticed several shortcomings of the current rseq code:

  1) task::rseq_event_mask is a pointless bitfield despite the fact that
     the ABI flags it was meant to support have been deprecated and
     functionally disabled three years ago.

  2) task::rseq_event_mask is accumulating bits unless there is a critical
     section discovered in the user space rseq memory. This results in
     pointless invocations of the rseq user space exit handler even if
     there had nothing changed. As a matter of correctness these bits have
     to be clear when exiting to user space and therefore pristine when
     coming back into the kernel. Aside of correctness, this also avoids
     pointless evaluation of the user space memory, which is a performance
     benefit.

  3) The evaluation of critical sections does not differentiate between
     syscall and interrupt/exception exits. The current implementation
     silently fixes up critical sections which invoked a syscall unless
     CONFIG_DEBUG_RSEQ is enabled.

     That's just wrong. If user space does that on a production kernel it
     can keep the pieces. The kernel is not there to proliferate mindless
     user space programming and letting everyone pay the performance
     penalty.

This series addresses these issues and on top converts parts of the user
space access over to the new masked access model, which lowers the overhead
of Spectre-V1 mitigations significantly on architectures which support it
(x86 as of today). This is especially noticable in the access to the
rseq_cs field in struct rseq, which is the first quick check to figure out
whether a critical section is installed or not.

It survives the kernels rseq selftests, but I did not any performance tests
vs. rseq because I have no idea how to use the gazillion of undocumented
command line parameters of the benchmark. I leave that to people who are so
familiar with them, that they assume everyone else is too :)

The performance gain on regular workloads is clearly measurable and the
consistent event flag state allows now to build the time slice extension
mechanism on top. The first POC I implemented:

   https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/

suffered badly from the stale eventmask bits and the cleaned up version
brought a whopping 25+% performance gain.

The series depends on the seperately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

For your conveniance it is also available as a conglomorate from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

Thanks,

	tglx
---
 drivers/hv/mshv_common.c         |    2 
 include/linux/entry-common.h     |   12 +--
 include/linux/irq-entry-common.h |   31 ++++++--
 include/linux/resume_user_mode.h |   40 +++++++---
 include/linux/rseq.h             |  115 +++++++++----------------------
 include/linux/sched.h            |   10 +-
 include/uapi/linux/rseq.h        |   21 +----
 io_uring/io_uring.h              |    2 
 kernel/entry/common.c            |    9 +-
 kernel/entry/kvm.c               |    2 
 kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
 kernel/sched/core.c              |    8 +-
 kernel/sched/membarrier.c        |    8 +-
 13 files changed, 213 insertions(+), 189 deletions(-)