[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5b70b86-ce87-435e-9ba5-407ea98e8c8b@efficios.com>
Date: Thu, 11 Sep 2025 09:40:57 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
<paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
<seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>, x86@...nel.org,
Arnd Bergmann <arnd@...db.de>, Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>,
Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V4 23/36] rseq: Provide and use rseq_set_ids()
On 2025-09-08 17:32, Thomas Gleixner wrote:
> Provide a new and straight forward implementation to set the IDs (CPU ID,
> Node ID and MM CID), which can be later inlined into the fast path.
>
> It does all operations in one user_rw_masked_begin() section and retrieves
> also the critical section member (rseq::cs_rseq) from user space to avoid
> another user..begin/end() pair. This is in preparation for optimizing the
> fast path to avoid extra work when not required.
>
> Use it to replace the whole related zoo in rseq.c
>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> ---
> V3: Fixed the node ID comparison in the debug path - Mathieu
> ---
> fs/binfmt_elf.c | 2
> include/linux/rseq_entry.h | 101 +++++++++++++++++++++
> include/linux/sched.h | 10 --
> kernel/rseq.c | 208 ++++++---------------------------------------
> 4 files changed, 134 insertions(+), 187 deletions(-)
>
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -46,7 +46,7 @@
> #include <linux/cred.h>
> #include <linux/dax.h>
> #include <linux/uaccess.h>
> -#include <linux/rseq.h>
> +#include <uapi/linux/rseq.h>
> #include <asm/param.h>
> #include <asm/page.h>
>
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -38,6 +38,8 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
> #include <linux/rseq.h>
> #include <linux/uaccess.h>
>
> +#include <uapi/linux/rseq.h>
> +
> #include <linux/tracepoint-defs.h>
>
> #ifdef CONFIG_TRACEPOINTS
> @@ -75,6 +77,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
> #endif
>
> bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> +bool rseq_debug_validate_ids(struct task_struct *t);
>
> static __always_inline void rseq_note_user_irq_entry(void)
> {
> @@ -196,6 +199,50 @@ bool rseq_debug_update_user_cs(struct ta
> user_access_end();
> return false;
> }
> +
> +/*
> + * On debug kernels validate that user space did not mess with it if
> + * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
> + * that case cpu_cid is ~0. See fork/execve.
> + */
> +bool rseq_debug_validate_ids(struct task_struct *t)
> +{
> + struct rseq __user *rseq = t->rseq.usrptr;
Why not check on NULL rseq user pointer rather than skip using
the ~0ULL cpu_cid ?
> + u32 cpu_id, uval, node_id;
> +
> + if (t->rseq.ids.cpu_cid == ~0)
Here we are using ~0 and where cpu_cid is set in rseq_set() ~0ULL is
used. Now I understand that because of sign-extend and type promotion
this happens to provide all bits set on a 64-bit, but can we please just
pick one, e.g. ~0ULL ?
Also why does rseq_fork() set it to ~0ULL rather than keep the value
from the parent when forking a new process ?
Whatever was present in the parent process in the rseq area should
still be in the child, including the rseq registration.
Or am missing something ?
> + return true;
> +
> + /*
> + * Look it up outside of the user access section as cpu_to_node()
> + * can end up in debug code.
> + */
> + node_id = cpu_to_node(t->rseq.ids.cpu_id);
> +
> + if (!user_read_masked_begin(rseq))
> + return false;
> +
> + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
> + if (cpu_id != t->rseq.ids.cpu_id)
> + goto die;
> + unsafe_get_user(uval, &rseq->cpu_id, efault);
> + if (uval != cpu_id)
> + goto die;
> + unsafe_get_user(uval, &rseq->node_id, efault);
> + if (uval != node_id)
> + goto die;
> + unsafe_get_user(uval, &rseq->mm_cid, efault);
> + if (uval != t->rseq.ids.mm_cid)
> + goto die;
> + user_access_end();
> + return true;
> +die:
> + t->rseq.event.fatal = true;
> +efault:
> + user_access_end();
> + return false;
> +}
> +
> #endif /* RSEQ_BUILD_SLOW_PATH */
>
> /*
> @@ -281,6 +328,60 @@ rseq_update_user_cs(struct task_struct *
> user_access_end();
> return false;
> }
> +
> +/*
> + * Updates CPU ID, Node ID and MM CID and reads the critical section
> + * address, when @csaddr != NULL. This allows to put the ID update and the
> + * read under the same uaccess region to spare a seperate begin/end.
separate
> + *
> + * As this is either invoked from a C wrapper with @csaddr = NULL or from
> + * the fast path code with a valid pointer, a clever compiler should be
> + * able to optimize the read out. Spares a duplicate implementation.
> + *
> + * Returns true, if the operation was successful, false otherwise.
> + *
> + * In the failure case task::rseq_event::fatal is set when invalid data
> + * was found on debug kernels. It's clear when the failure was an unresolved page
> + * fault.
> + *
> + * If inlined into the exit to user path with interrupts disabled, the
> + * caller has to protect against page faults with pagefault_disable().
> + *
> + * In preemptible task context this would be counterproductive as the page
> + * faults could not be fully resolved. As a consequence unresolved page
> + * faults in task context are fatal too.
> + */
> +static rseq_inline
> +bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
> + u32 node_id, u64 *csaddr)
> +{
> + struct rseq __user *rseq = t->rseq.usrptr;
> +
> + if (static_branch_unlikely(&rseq_debug_enabled)) {
> + if (!rseq_debug_validate_ids(t))
> + return false;
> + }
> +
> + if (!user_rw_masked_begin(rseq))
> + return false;
> +
> + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
> + unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
> + unsafe_put_user(node_id, &rseq->node_id, efault);
> + unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
> + if (csaddr)
> + unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
> + user_access_end();
> +
> + /* Cache the new values */
> + t->rseq.ids.cpu_cid = ids->cpu_cid;
> + rseq_stat_inc(rseq_stats.ids);
> + rseq_trace_update(t, ids);
> + return true;
> +efault:
> + user_access_end();
> + return false;
> +}
>
> static __always_inline void rseq_exit_to_user_mode(void)
> {
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -42,7 +42,6 @@
> #include <linux/posix-timers_types.h>
> #include <linux/restart_block.h>
> #include <linux/rseq_types.h>
> -#include <uapi/linux/rseq.h>
> #include <linux/seqlock_types.h>
> #include <linux/kcsan.h>
> #include <linux/rv.h>
> @@ -1402,15 +1401,6 @@ struct task_struct {
> #endif /* CONFIG_NUMA_BALANCING */
>
> struct rseq_data rseq;
> -#ifdef CONFIG_DEBUG_RSEQ
> - /*
> - * This is a place holder to save a copy of the rseq fields for
> - * validation of read-only fields. The struct rseq has a
> - * variable-length array at the end, so it cannot be used
> - * directly. Reserve a size large enough for the known fields.
> - */
> - char rseq_fields[sizeof(struct rseq)];
> -#endif
>
> #ifdef CONFIG_SCHED_MM_CID
> int mm_cid; /* Current cid in mm */
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -88,13 +88,6 @@
> # define RSEQ_EVENT_GUARD preempt
> #endif
>
> -/* The original rseq structure size (including padding) is 32 bytes. */
> -#define ORIG_RSEQ_SIZE 32
> -
> -#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
> - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
> - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
> -
> DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
>
> static inline void rseq_control_debug(bool on)
> @@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void
> __initcall(rseq_debugfs_init);
> #endif /* CONFIG_DEBUG_FS */
>
> -#ifdef CONFIG_DEBUG_RSEQ
> -static struct rseq *rseq_kernel_fields(struct task_struct *t)
> -{
> - return (struct rseq *) t->rseq_fields;
> -}
> -
> -static int rseq_validate_ro_fields(struct task_struct *t)
> -{
> - static DEFINE_RATELIMIT_STATE(_rs,
> - DEFAULT_RATELIMIT_INTERVAL,
> - DEFAULT_RATELIMIT_BURST);
> - u32 cpu_id_start, cpu_id, node_id, mm_cid;
> - struct rseq __user *rseq = t->rseq.usrptr;
> -
> - /*
> - * Validate fields which are required to be read-only by
> - * user-space.
> - */
> - if (!user_read_access_begin(rseq, t->rseq.len))
> - goto efault;
> - unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
> - unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
> - unsafe_get_user(node_id, &rseq->node_id, efault_end);
> - unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
> - user_read_access_end();
> -
> - if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
> - cpu_id != rseq_kernel_fields(t)->cpu_id ||
> - node_id != rseq_kernel_fields(t)->node_id ||
> - mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
> -
> - pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
> - "\tcpu_id_start: %u ?= %u\n"
> - "\tcpu_id: %u ?= %u\n"
> - "\tnode_id: %u ?= %u\n"
> - "\tmm_cid: %u ?= %u\n",
> - t->pid, t->comm,
> - cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
> - cpu_id, rseq_kernel_fields(t)->cpu_id,
> - node_id, rseq_kernel_fields(t)->node_id,
> - mm_cid, rseq_kernel_fields(t)->mm_cid);
> - }
> -
> - /* For now, only print a console warning on mismatch. */
> - return 0;
> -
> -efault_end:
> - user_read_access_end();
> -efault:
> - return -EFAULT;
> -}
> -
> -/*
> - * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
> - * state.
> - */
> -#define rseq_unsafe_put_user(t, value, field, error_label) \
> - do { \
> - unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
> - rseq_kernel_fields(t)->field = value; \
> - } while (0)
> -
> -#else
> -static int rseq_validate_ro_fields(struct task_struct *t)
> -{
> - return 0;
> -}
> -
> -#define rseq_unsafe_put_user(t, value, field, error_label) \
> - unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
> -#endif
> -
> -static int rseq_update_cpu_node_id(struct task_struct *t)
> -{
> - struct rseq __user *rseq = t->rseq.usrptr;
> - u32 cpu_id = raw_smp_processor_id();
> - u32 node_id = cpu_to_node(cpu_id);
> - u32 mm_cid = task_mm_cid(t);
> -
> - rseq_stat_inc(rseq_stats.ids);
> -
> - /* Validate read-only rseq fields on debug kernels */
> - if (rseq_validate_ro_fields(t))
> - goto efault;
> - WARN_ON_ONCE((int) mm_cid < 0);
> -
> - if (!user_write_access_begin(rseq, t->rseq.len))
> - goto efault;
> -
> - rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
> - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
> - rseq_unsafe_put_user(t, node_id, node_id, efault_end);
> - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
> -
> - /* Cache the user space values */
> - t->rseq.ids.cpu_id = cpu_id;
> - t->rseq.ids.mm_cid = mm_cid;
> -
> - /*
> - * Additional feature fields added after ORIG_RSEQ_SIZE
> - * need to be conditionally updated only if
> - * t->rseq_len != ORIG_RSEQ_SIZE.
> - */
> - user_write_access_end();
> - trace_rseq_update(t);
> - return 0;
> -
> -efault_end:
> - user_write_access_end();
> -efault:
> - return -EFAULT;
> -}
> -
> -static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
> +static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
> {
> - struct rseq __user *rseq = t->rseq.usrptr;
> - u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
> - mm_cid = 0;
> -
> - /*
> - * Validate read-only rseq fields.
> - */
> - if (rseq_validate_ro_fields(t))
> - goto efault;
> -
> - if (!user_write_access_begin(rseq, t->rseq.len))
> - goto efault;
> -
> - /*
> - * Reset all fields to their initial state.
> - *
> - * All fields have an initial state of 0 except cpu_id which is set to
> - * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
> - * unregistration can figure out that rseq needs to be registered
> - * again.
> - */
> - rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
> - rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
> - rseq_unsafe_put_user(t, node_id, node_id, efault_end);
> - rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
> -
> - /*
> - * Additional feature fields added after ORIG_RSEQ_SIZE
> - * need to be conditionally reset only if
> - * t->rseq_len != ORIG_RSEQ_SIZE.
> - */
> - user_write_access_end();
> - return 0;
> -
> -efault_end:
> - user_write_access_end();
> -efault:
> - return -EFAULT;
> + return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
> }
>
> static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
> @@ -407,6 +250,8 @@ static bool rseq_handle_cs(struct task_s
> void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> {
> struct task_struct *t = current;
> + struct rseq_ids ids;
> + u32 node_id;
> bool event;
> int sig;
>
> @@ -453,6 +298,8 @@ void __rseq_handle_notify_resume(struct
> scoped_guard(RSEQ_EVENT_GUARD) {
> event = t->rseq.event.sched_switch;
> t->rseq.event.sched_switch = false;
> + ids.cpu_id = task_cpu(t);
> + ids.mm_cid = task_mm_cid(t);
> }
>
> if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
> @@ -461,7 +308,8 @@ void __rseq_handle_notify_resume(struct
> if (!rseq_handle_cs(t, regs))
> goto error;
>
> - if (unlikely(rseq_update_cpu_node_id(t)))
> + node_id = cpu_to_node(ids.cpu_id);
> + if (!rseq_set_ids(t, &ids, node_id))
> goto error;
> return;
>
> @@ -502,13 +350,33 @@ void rseq_syscall(struct pt_regs *regs)
> }
> #endif
>
> +static bool rseq_reset_ids(void)
> +{
> + struct rseq_ids ids = {
> + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> + .mm_cid = 0,
> + };
> +
> + /*
> + * If this fails, terminate it because this leaves the kernel in
> + * stupid state as exit to user space will try to fixup the ids
> + * again.
> + */
> + if (rseq_set_ids(current, &ids, 0))
> + return true;
> +
> + force_sig(SIGSEGV);
> + return false;
> +}
> +
> +/* The original rseq structure size (including padding) is 32 bytes. */
> +#define ORIG_RSEQ_SIZE 32
> +
> /*
> * sys_rseq - setup restartable sequences for caller thread.
> */
> SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
> {
> - int ret;
> -
> if (flags & RSEQ_FLAG_UNREGISTER) {
> if (flags & ~RSEQ_FLAG_UNREGISTER)
> return -EINVAL;
> @@ -519,9 +387,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> return -EINVAL;
> if (current->rseq.sig != sig)
> return -EPERM;
> - ret = rseq_reset_rseq_cpu_node_id(current);
> - if (ret)
> - return ret;
> + if (!rseq_reset_ids())
> + return -EFAULT;
AFAIU, this is only done to have the debug checks, stats, and
tracing side-effects, because the content of cpu_cid will be set
to ~0ULL by the following rseq_reset, am I correct ?
It's kind of weird to do some reset work to have it immediately
overwritten.
> rseq_reset(current);
> return 0;
> }
> @@ -571,17 +438,6 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> if (put_user_masked_u64(0UL, &rseq->rseq_cs))
> return -EFAULT;
>
> -#ifdef CONFIG_DEBUG_RSEQ
> - /*
> - * Initialize the in-kernel rseq fields copy for validation of
> - * read-only fields.
> - */
> - if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
> - get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
> - get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
> - get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
> - return -EFAULT;
> -#endif
OK I think I get why you are using cpu_cid != ~0ULL rather than the rseq
user pointer in the debug code. After rseq registration, now that you
remove those get_user, the kernel vs userspace fields are out of sync,
so you need a way to figure this out on return to userspace from
sys_rseq register.
Thanks,
Mathieu
> /*
> * Activate the registration by setting the rseq area address, length
> * and signature in the task struct.
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists