lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250823161655.651830871@linutronix.de>
Date: Sat, 23 Aug 2025 18:40:42 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Jens Axboe <axboe@...nel.dk>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 Boqun Feng <boqun.feng@...il.com>,
 Paolo Bonzini <pbonzini@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>,
 Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>,
 x86@...nel.org,
 Arnd Bergmann <arnd@...db.de>,
 Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>,
 Huacai Chen <chenhuacai@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>
Subject: [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit

Further analysis of the exit path with the seperate TIF_RSEQ showed that
depending on the workload a significant amount of invocations of
resume_user_mode_work() ends up with no other bit set than TIF_RSEQ.

On architectures with a separate TIF_RSEQ this can be distinguished and
checked right at the beginning of the function before entering the loop.

The quick check is lightweight so it does not impose a massive penalty on
non-RSEQ use cases. It just checks for the work being empty, except for
TIF_RSEQ and jumps right into the handling fast path.

This is truly the only TIF bit there which can be optimized that way
because the handling runs only when all the other work has been done. The
optimization spares a full round trip through the other conditionals and an
interrupt enable/disable pair. The generated code looks reasonable enough
to justify this and the resulting numbers do so as well.

The main beneficiaries are blocking syscall heavy work loads, where the
tasks often end up being scheduled on a different CPU or get a different MM
CID, but have no other work to handle on return.

A futex benchmark showed up to 90% shortcut utilization and a measurable
improvement in perf of ~1%. Non-scheduling work loads do neither see an
improvement nor degrade. A full kernel build shows about 15% shortcuts,
but no measurable side effects in either direction.

Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
---
 include/linux/rseq_entry.h |   14 ++++++++++++++
 kernel/entry/common.c      |   13 +++++++++++--
 kernel/rseq.c              |    2 ++
 3 files changed, 27 insertions(+), 2 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -11,6 +11,7 @@ struct rseq_stats {
 	unsigned long	signal;
 	unsigned long	slowpath;
 	unsigned long	fastpath;
+	unsigned long	quicktif;
 	unsigned long	ids;
 	unsigned long	cs;
 	unsigned long	clear;
@@ -532,6 +533,14 @@ rseq_exit_to_user_mode_work(struct pt_re
 	return ti_work | _TIF_NOTIFY_RESUME;
 }
 
+static __always_inline bool
+rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
+{
+	if (IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS))
+		return (ti_work & mask) == CHECK_TIF_RSEQ;
+	return false;
+}
+
 #endif /* !CONFIG_GENERIC_ENTRY */
 
 static __always_inline void rseq_syscall_exit_to_user_mode(void)
@@ -577,6 +586,11 @@ static inline unsigned long rseq_exit_to
 {
 	return ti_work;
 }
+
+static inline bool rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
+{
+	return false;
+}
 static inline void rseq_note_user_irq_entry(void) { }
 static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -22,7 +22,14 @@ void __weak arch_do_signal_or_restart(st
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
+	 *
+	 * Optimize for TIF_RSEQ being the only bit set.
 	 */
+	if (rseq_exit_to_user_mode_early(ti_work, EXIT_TO_USER_MODE_WORK)) {
+		rseq_stat_inc(rseq_stats.quicktif);
+		goto do_rseq;
+	}
+
 	do {
 		local_irq_enable_exit_to_user(ti_work);
 
@@ -56,10 +63,12 @@ void __weak arch_do_signal_or_restart(st
 
 		ti_work = read_thread_flags();
 
+	do_rseq:
 		/*
 		 * This returns the unmodified ti_work, when ti_work is not
-		 * empty. In that case it waits for the next round to avoid
-		 * multiple updates in case of rescheduling.
+		 * empty (except for TIF_RSEQ). In that case it waits for
+		 * the next round to avoid multiple updates in case of
+		 * rescheduling.
 		 *
 		 * When it handles rseq it returns either with empty work
 		 * on success or with TIF_NOTIFY_RESUME set on failure to
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -134,6 +134,7 @@ static int rseq_stats_show(struct seq_fi
 		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
 		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
 		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
+		stats.quicktif	+= data_race(per_cpu(rseq_stats.quicktif, cpu));
 		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
 		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
 		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
@@ -144,6 +145,7 @@ static int rseq_stats_show(struct seq_fi
 	seq_printf(m, "signal: %16lu\n", stats.signal);
 	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
 	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
+	seq_printf(m, "quickt: %16lu\n", stats.quicktif);
 	seq_printf(m, "ids:    %16lu\n", stats.ids);
 	seq_printf(m, "cs:     %16lu\n", stats.cs);
 	seq_printf(m, "clear:  %16lu\n", stats.clear);


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ