lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904185336.943880027@linutronix.de>
Date: Fri,  5 Sep 2025 00:20:33 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>,
 Jens Axboe <axboe@...nel.dk>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 Boqun Feng <boqun.feng@...il.com>,
 Paolo Bonzini <pbonzini@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>,
 Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>,
 x86@...nel.org,
 Arnd Bergmann <arnd@...db.de>,
 Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>,
 Huacai Chen <chenhuacai@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>
Subject: [patch V3 00/37] rseq: Optimize exit to user space

This is a follow up on the V2 series, which can be found here:

   https://lore.kernel.org/all/20250823161326.635281786@linutronix.de

The V2 posting contains a detailed list of the addressed problems. TLDR:

    - A significant amount of pointless RSEQ operations on exit to user
      space, which have been reported by people as measurable impact after
      glibc switched to use RSEQ

    - Suboptimal hotpath handling both in the scheduler and on exit to user
      space.

This series addresses these issues by:

  1) Limiting the RSEQ work to the actual conditions where it is
     required. The full benefit is only available for architectures using
     the generic entry infrastructure. All others get at least the basic
     improvements.

  2) Re-implementing the whole user space handling based on proper data
     structures and by actually looking at the impact it creates in the
     fast path.

  3) Moving the actual handling of RSEQ out to the latest point in the exit
     path, where possible. This is fully inlined into the fast path to keep
     the impact confined.

Changes vs. V2:

  - Bring back the ROP protection - Mathieu

  - Document the guest visible change when host TLS is mapped into guest - Sean

  - Document the TIF_RSEQ optimization for virt - Sean

  - Fix the __setup() return value - Michael

  - Add the missing include in HV - 0-day

  - Rename *uids to *ids - Mathieu

  - Spelling and grammar fixes in comments and change logs - Mathieu

  - Picked up tags where appropriate

Delta patch to V2 is below.

As for the previous version these patches have a pile of dependencies:

The series depends on the separately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

The combination of all of them and some other related fixes (rseq
selftests) are available here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

Thanks,

	tglx
---
 Documentation/admin-guide/kernel-parameters.txt |    4 
 arch/Kconfig                                    |    4 
 arch/loongarch/Kconfig                          |    1 
 arch/loongarch/include/asm/thread_info.h        |   76 +-
 arch/riscv/Kconfig                              |    1 
 arch/riscv/include/asm/thread_info.h            |   31 -
 arch/s390/Kconfig                               |    1 
 arch/s390/include/asm/thread_info.h             |   44 -
 arch/x86/Kconfig                                |    1 
 arch/x86/entry/syscall_32.c                     |    3 
 arch/x86/include/asm/thread_info.h              |   76 +-
 drivers/hv/mshv_root_main.c                     |    3 
 fs/binfmt_elf.c                                 |    2 
 fs/exec.c                                       |    2 
 include/asm-generic/thread_info_tif.h           |   51 +
 include/linux/entry-common.h                    |   38 -
 include/linux/irq-entry-common.h                |   68 ++
 include/linux/mm.h                              |   25 
 include/linux/resume_user_mode.h                |    2 
 include/linux/rseq.h                            |  223 +++++---
 include/linux/rseq_entry.h                      |  621 ++++++++++++++++++++++++
 include/linux/rseq_types.h                      |   72 ++
 include/linux/sched.h                           |   50 +
 include/linux/thread_info.h                     |    5 
 include/trace/events/rseq.h                     |    4 
 include/uapi/linux/rseq.h                       |   21 
 init/Kconfig                                    |   28 +
 kernel/entry/common.c                           |   37 -
 kernel/entry/syscall-common.c                   |    8 
 kernel/rseq.c                                   |  604 +++++++++--------------
 kernel/sched/core.c                             |   10 
 kernel/sched/membarrier.c                       |    8 
 kernel/sched/sched.h                            |    5 
 virt/kvm/kvm_main.c                             |    3 
 34 files changed, 1433 insertions(+), 699 deletions(-)
---
Delta to V2:

--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -28,6 +28,7 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
+#include <linux/rseq.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -241,7 +241,7 @@ static __always_inline void __exit_to_us
  * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
  *
- * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
  * syscalls and interrupts.
  */
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
@@ -255,7 +255,7 @@ static __always_inline void syscall_exit
  * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
  *
- * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
  * syscalls and interrupts.
  */
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -112,17 +112,24 @@ static inline void rseq_force_update(voi
 
 /*
  * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
  *
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
  */
 static inline void rseq_virt_userspace_exit(void)
 {
-	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
+	/*
+	 * The generic optimization for deferring RSEQ updates until the next
+	 * exit relies on having a dedicated TIF_RSEQ.
+	 */
+	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+	    current->rseq_event.sched_switch)
 		rseq_raise_notify_resume(current);
 }
 
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -53,10 +53,8 @@ void __rseq_trace_ip_fixup(unsigned long
 
 static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
 {
-	if (tracepoint_enabled(rseq_update)) {
-		if (ids)
-			__rseq_trace_update(t);
-	}
+	if (tracepoint_enabled(rseq_update) && ids)
+		__rseq_trace_update(t);
 }
 
 static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
@@ -81,7 +79,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
 #endif
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
-bool rseq_debug_validate_uids(struct task_struct *t);
+bool rseq_debug_validate_ids(struct task_struct *t);
 
 static __always_inline void rseq_note_user_irq_entry(void)
 {
@@ -209,14 +207,20 @@ bool rseq_debug_update_user_cs(struct ta
  * debugging is enabled, but don't do that on the first exit to user
  * space. In that case cpu_cid is ~0. See fork/execve.
  */
-bool rseq_debug_validate_uids(struct task_struct *t)
+bool rseq_debug_validate_ids(struct task_struct *t)
 {
-	u32 cpu_id, uval, node_id = cpu_to_node(task_cpu(t));
 	struct rseq __user *rseq = t->rseq;
+	u32 cpu_id, uval, node_id;
 
 	if (t->rseq_ids.cpu_cid == ~0)
 		return true;
 
+	/*
+	 * Look it up outside of the user access section as cpu_to_node()
+	 * can end up in debug code.
+	 */
+	node_id = cpu_to_node(t->rseq_ids.cpu_id);
+
 	if (!user_read_masked_begin(rseq))
 		return false;
 
@@ -252,11 +256,13 @@ rseq_update_user_cs(struct task_struct *
 {
 	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
 	unsigned long ip = instruction_pointer(regs);
+	unsigned long tasksize = TASK_SIZE;
 	u64 start_ip, abort_ip, offset;
+	u32 usig, __user *uc_sig;
 
 	rseq_stat_inc(rseq_stats.cs);
 
-	if (unlikely(csaddr >= TASK_SIZE)) {
+	if (unlikely(csaddr >= tasksize)) {
 		t->rseq_event.fatal = true;
 		return false;
 	}
@@ -281,15 +287,28 @@ rseq_update_user_cs(struct task_struct *
 		goto clear;
 
 	/*
-	 * Force it to be in user space as x86 IRET would happily return to
-	 * the kernel. Can't use TASK_SIZE as a mask because that's not
-	 * necessarily a power of two. Just make sure it's in the user
-	 * address space. Let the pagefault handler sort it out.
+	 * Two requirements for @abort_ip:
+	 *   - Must be in user space as x86 IRET would happily return to
+	 *     the kernel.
+	 *   - The four bytes preceeding the instruction at @abort_ip must
+	 *     contain the signature.
+	 *
+	 * The latter protects against the following attack vector:
 	 *
-	 * Use LONG_MAX and not LLONG_MAX to keep it correct for 32 and 64
-	 * bit architectures.
+	 * An attacker with limited abilities to write, creates a critical
+	 * section descriptor, sets the abort IP to a library function or
+	 * some other ROP gadget and stores the address of the descriptor
+	 * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+	 * protection.
 	 */
-	abort_ip &= (u64)LONG_MAX;
+	if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+		goto die;
+
+	/* The address is guaranteed to be >= 0 and < TASK_SIZE */
+	uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+	unsafe_get_user(usig, uc_sig, fail);
+	if (unlikely(usig != t->rseq_sig))
+		goto die;
 
 	/* Invalidate the critical section */
 	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
@@ -306,7 +325,8 @@ rseq_update_user_cs(struct task_struct *
 	user_access_end();
 	rseq_stat_inc(rseq_stats.clear);
 	return true;
-
+die:
+	t->rseq_event.fatal = true;
 fail:
 	user_access_end();
 	return false;
@@ -335,13 +355,13 @@ rseq_update_user_cs(struct task_struct *
  * faults in task context are fatal too.
  */
 static rseq_inline
-bool rseq_set_uids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
-			      u32 node_id, u64 *csaddr)
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+			     u32 node_id, u64 *csaddr)
 {
 	struct rseq __user *rseq = t->rseq;
 
 	if (static_branch_unlikely(&rseq_debug_enabled)) {
-		if (!rseq_debug_validate_uids(t))
+		if (!rseq_debug_validate_ids(t))
 			return false;
 	}
 
@@ -375,7 +395,7 @@ static rseq_inline bool rseq_update_usr(
 {
 	u64 csaddr;
 
-	if (!rseq_set_uids_get_csaddr(t, ids, node_id, &csaddr))
+	if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
 		return false;
 
 	/*
@@ -507,6 +527,7 @@ static __always_inline bool __rseq_exit_
 # define CHECK_TIF_RSEQ		_TIF_RSEQ
 static __always_inline void clear_tif_rseq(void)
 {
+	static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
 	clear_thread_flag(TIF_RSEQ);
 }
 #else
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -3,13 +3,13 @@
 #define _LINUX_RSEQ_TYPES_H
 
 #include <linux/types.h>
-/* Forward declaration for the sched.h */
+/* Forward declaration for sched.h */
 struct rseq;
 
 /*
  * struct rseq_event - Storage for rseq related event management
  * @all:		Compound to initialize and clear the data efficiently
- * @events:		Compund to access events with a single load/store
+ * @events:		Compound to access events with a single load/store
  * @sched_switch:	True if the task was scheduled and needs update on
  *			exit to user
  * @ids_changed:	Indicator that IDs need to be updated
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -99,7 +99,7 @@ static int __init rseq_setup_debug(char
 	if (kstrtobool(str, &on))
 		return -EINVAL;
 	rseq_control_debug(on);
-	return 0;
+	return 1;
 }
 __setup("rseq_debug=", rseq_setup_debug);
 
@@ -218,9 +218,9 @@ static int __init rseq_debugfs_init(void
 __initcall(rseq_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
-static bool rseq_set_uids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
 {
-	return rseq_set_uids_get_csaddr(t, ids, node_id, NULL);
+	return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
 }
 
 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -374,7 +374,7 @@ static bool rseq_reset_ids(void)
 	 * stupid state as exit to user space will try to fixup the ids
 	 * again.
 	 */
-	if (rseq_set_uids(current, &ids, 0))
+	if (rseq_set_ids(current, &ids, 0))
 		return true;
 
 	force_sig(SIGSEGV);



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ