[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d2380be-20c1-4c0d-8067-efedcce6c9d5@efficios.com>
Date: Tue, 26 Aug 2025 11:33:37 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Jens Axboe <axboe@...nel.dk>, Peter Zijlstra <peterz@...radead.org>,
"Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
<seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>, x86@...nel.org,
Arnd Bergmann <arnd@...db.de>, Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>,
Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V2 27/37] rseq: Implement fast path for exit to user
On 2025-08-23 12:40, Thomas Gleixner wrote:
> Implement the actual logic for handling RSEQ updates in a fast path after
> handling the TIF work and at the point where the task is actually returning
> to user space.
>
> This is the right point to do that because at this point the CPU and the MM
> CID are stable and cannot longer change due to yet another reschedule.
> That happens when the task is handling it via TIF_NOTIFY_RESUME in
> resume_user_mode_work(), which is invoked from the exit to user mode work
> loop.
>
> The function is invoked after the TIF work is handled and runs with
> interrupts disabled, which means it cannot resolve page faults. It
> therefore disables page faults and in case the access to the user space
> memory faults, it:
>
> - notes the fail in the event struct
> - raises TIF_NOTIFY_RESUME
> - returns false to the caller
>
> The caller has to go back to the TIF work, which runs with interrupts
> enabled and therefore can resolve the page faults. This happens mostly on
> fork() when the memory is marked COW. That will be optimized by setting the
> failure flag and raising TIF_NOTIFY_RESUME right on fork to avoid the
> otherwise unavoidable round trip.
>
> If the user memory inspection finds invalid data, the function returns
> false as well and sets the fatal flag in the event struct along with
> TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
> and terminate the task with SIGSEGV as documented.
>
> The initial decision to invoke any of this is based on two flags in the
> event struct: @has_rseq and @sched_switch. The decision is in pseudo ASM:
>
> load tsk::event::has_rseq
> and tsk::event::sched_switch
> jnz inspect_user_space
> mov $0, tsk::event::events
> ...
> leave
>
> So for the common case where the task was not scheduled out, this really
> boils down to four instructions before going out if the compiler is not
> completely stupid (and yes, some of them are).
>
> If the condition is true, then it checks, whether CPU ID or MM CID have
> changed. If so, then the CPU/MM IDs have to be updated and are thereby
> cached for the next round. The update unconditionally retrieves the user
> space critical section address to spare another user*begin/end() pair. If
> that's not zero and tsk::event::user_irq is set, then the critical section
> is analyzed and acted upon. If either zero or the entry came via syscall
> the critical section analysis is skipped.
>
> If the comparison is false then the critical section has to be analyzed
> because the event flag is then only true when entry from user was by
> interrupt.
>
> This is provided without the actual hookup to let reviewers focus on the
> implementation details. The hookup happens in the next step.
>
> Note: As with quite some other optimizations this depends on the generic
> entry infrastructure and is not enabled to be sucked into random
> architecture implementations.
>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> ---
> include/linux/rseq_entry.h | 137 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/rseq_types.h | 3
> kernel/rseq.c | 2
> 3 files changed, 139 insertions(+), 3 deletions(-)
>
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -10,6 +10,7 @@ struct rseq_stats {
> unsigned long exit;
> unsigned long signal;
> unsigned long slowpath;
> + unsigned long fastpath;
> unsigned long ids;
> unsigned long cs;
> unsigned long clear;
> @@ -204,8 +205,8 @@ bool rseq_debug_update_user_cs(struct ta
>
> /*
> * On debug kernels validate that user space did not mess with it if
> - * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
> - * that case cpu_cid is ~0. See fork/execve.
> + * debugging is enabled, but don't do that on the first exit to user
> + * space. In that case cpu_cid is ~0. See fork/execve.
> */
> bool rseq_debug_validate_uids(struct task_struct *t)
> {
> @@ -393,6 +394,131 @@ static rseq_inline bool rseq_update_usr(
> return rseq_update_user_cs(t, regs, csaddr);
> }
>
> +/*
> + * If you want to use this then convert your architecture to the generic
> + * entry code. I'm tired of building workarounds for people who can't be
> + * bothered to make the maintainence of generic infrastructure less
> + * burdensome. Just sucking everything into the architecture code and
> + * thereby making others chase the horrible hacks and keep them working is
> + * neither acceptable nor sustainable.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +
> +/*
> + * This is inlined into the exit path because:
> + *
> + * 1) It's a one time comparison in the fast path when there is no event to
> + * handle
> + *
> + * 2) The access to the user space rseq memory (TLS) is unlikely to fault
> + * so the straight inline operation is:
> + *
> + * - Four 32-bit stores only if CPU ID/ MM CID need to be updated
> + * - One 64-bit load to retrieve the critical section address
> + *
> + * 3) In the unlikely case that the critical section address is != NULL:
> + *
> + * - One 64-bit load to retrieve the start IP
> + * - One 64-bit load to retrieve the offset for calculating the end
> + * - One 64-bit load to retrieve the abort IP
> + * - One store to clear the critical section address
> + *
> + * The non-debug case implements only the minimal required checking and
> + * protection against a rogue abort IP in kernel space, which would be
> + * exploitable at least on x86. Any fallout from invalid critical section
> + * descriptors is a user space problem. The debug case provides the full
> + * set of checks and terminates the task if a condition is not met.
> + *
> + * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
> + * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
> + * slow path there will handle the fail.
> + */
> +static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
> +{
> + struct task_struct *t = current;
> +
> + /*
> + * If the task did not go through schedule or got the flag enforced
> + * by the rseq syscall or execve, then nothing to do here.
> + *
> + * CPU ID and MM CID can only change when going through a context
> + * switch.
> + *
> + * This can only be done when rseq_event::has_rseq is true.
> + * rseq_sched_switch_event() sets rseq_event::sched unconditionally
> + * true to avoid a load of rseq_event::has_rseq in the context
> + * switch path.
> + *
> + * This check uses a '&' and not a '&&' to force the compiler to do
> + * an actual AND operation instead of two seperate conditionals.
> + *
> + * A sane compiler requires four instructions for the nothing to do
> + * case including clearing the events, but your milage might vary.
See my earlier comments about:
- Handling of dynamic numa node id to cpu mapping reconfiguration on
powerpc.
- Validation of the abort handler signature on production kernels.
Thanks,
Mathieu
> + */
> + if (likely(!(t->rseq_event.sched_switch & t->rseq_event.has_rseq)))
> + goto done;
> +
> + rseq_stat_inc(rseq_stats.fastpath);
> +
> + pagefault_disable();
> +
> + if (likely(!t->rseq_event.ids_changed)) {
> + /*
> + * If IDs have not changed rseq_event::user_irq must be true
> + * See rseq_sched_switch_event().
> + */
> + u64 csaddr;
> +
> + if (unlikely(get_user_masked_u64(&csaddr, &t->rseq->rseq_cs)))
> + goto fail;
> +
> + if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
> + if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
> + goto fail;
> + }
> + } else {
> + struct rseq_ids ids = {
> + .cpu_id = task_cpu(t),
> + .mm_cid = task_mm_cid(t),
> + };
> + u32 node_id = cpu_to_node(ids.cpu_id);
> +
> + if (unlikely(!rseq_update_usr(t, regs, &ids, node_id)))
> + goto fail;
> + }
> +
> + pagefault_enable();
> +
> +done:
> + /* Clear state so next entry starts from a clean slate */
> + t->rseq_event.events = 0;
> + return false;
> +
> +fail:
> + pagefault_enable();
> + /* Force it into the slow path. Don't clear the state! */
> + t->rseq_event.slowpath = true;
> + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> + return true;
> +}
> +
> +static __always_inline unsigned long
> +rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, const unsigned long mask)
> +{
> + /*
> + * Check if all work bits have been cleared before handling rseq.
> + */
> + if ((ti_work & mask) != 0)
> + return ti_work;
> +
> + if (likely(!__rseq_exit_to_user_mode_restart(regs)))
> + return ti_work;
> +
> + return ti_work | _TIF_NOTIFY_RESUME;
> +}
> +
> +#endif /* !CONFIG_GENERIC_ENTRY */
> +
> static __always_inline void rseq_exit_to_user_mode(void)
> {
> struct rseq_event *ev = ¤t->rseq_event;
> @@ -417,8 +543,13 @@ static inline void rseq_debug_syscall_re
> if (static_branch_unlikely(&rseq_debug_enabled))
> __rseq_debug_syscall_return(regs);
> }
> -
> #else /* CONFIG_RSEQ */
> +static inline unsigned long rseq_exit_to_user_mode_work(struct pt_regs *regs,
> + unsigned long ti_work,
> + const unsigned long mask)
> +{
> + return ti_work;
> +}
> static inline void rseq_note_user_irq_entry(void) { }
> static inline void rseq_exit_to_user_mode(void) { }
> static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -17,6 +17,8 @@ struct rseq;
> * @has_rseq: True if the task has a rseq pointer installed
> * @error: Compound error code for the slow path to analyze
> * @fatal: User space data corrupted or invalid
> + * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
> + * is required
> *
> * @sched_switch and @ids_changed must be adjacent and the combo must be
> * 16bit aligned to allow a single store, when both are set at the same
> @@ -41,6 +43,7 @@ struct rseq_event {
> u16 error;
> struct {
> u8 fatal;
> + u8 slowpath;
> };
> };
> };
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi
> stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
> stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
> stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
> + stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu));
> stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
> stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
> stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
> @@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi
> seq_printf(m, "exit: %16lu\n", stats.exit);
> seq_printf(m, "signal: %16lu\n", stats.signal);
> seq_printf(m, "slowp: %16lu\n", stats.slowpath);
> + seq_printf(m, "fastp: %16lu\n", stats.fastpath);
> seq_printf(m, "ids: %16lu\n", stats.ids);
> seq_printf(m, "cs: %16lu\n", stats.cs);
> seq_printf(m, "clear: %16lu\n", stats.clear);
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists