lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2ce887bd-f0f1-46bd-a56e-7e35d60880dc@efficios.com>
Date: Thu, 11 Sep 2025 11:41:53 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Jonathan Corbet <corbet@....net>,
 Prakash Sangappa <prakash.sangappa@...cle.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org,
 Michael Jeanson <mjeanson@...icios.com>
Subject: Re: [patch 02/12] rseq: Add fields and constants for time slice
 extension

On 2025-09-08 18:59, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
> 
>     - Two flag bits for the rseq user space ABI, which allow user space to
>       query the availability and enablement without a syscall.
> 
>     - A new member to the user space ABI struct rseq, which is going to be
>       used to communicate request and grant between kernel and user space.
> 
>     - A rseq state struct to hold the kernel state of this
> 
>     - Documentation of the new mechanism
> 
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: "Paul E. McKenney" <paulmck@...nel.org>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: Jonathan Corbet <corbet@....net>
> Cc: Prakash Sangappa <prakash.sangappa@...cle.com>
> Cc: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> Cc: K Prateek Nayak <kprateek.nayak@....com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> ---
>   Documentation/userspace-api/index.rst |    1
>   Documentation/userspace-api/rseq.rst  |  129 ++++++++++++++++++++++++++++++++++
>   include/linux/rseq_types.h            |   26 ++++++
>   include/uapi/linux/rseq.h             |   28 +++++++
>   init/Kconfig                          |   12 +++
>   kernel/rseq.c                         |    8 ++
>   6 files changed, 204 insertions(+)
> 
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -21,6 +21,7 @@ System calls
>      ebpf/index
>      ioctl/index
>      mseal
> +   rseq
>   
>   Security-related interfaces
>   ===========================
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,129 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and user-space for three purposes:
> +
> + * user-space restartable sequences
> +
> + * quick access to read the current CPU number, node ID from user-space

Also reading the "concurrency ID" (mm_cid).

> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow user-space to perform update operations on
> +per-cpu data without requiring heavy-weight atomic operations. The actual
> +ABI is unfortunately only available in the code and selftests.

Note that I've done a man page available here:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

which describes the ABI.

> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(

At what level should we document this here ? Would it be OK to show examples
that rely on librseq helpers ?

> +
> +Scheduler time slice extensions
> +-------------------------------
> +

Note: I suspect we'll also want to add this section to the rseq(2) man page.

> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> +    * Enabled in Kconfig
> +
> +    * Enabled at boot time (default is enabled)
> +
> +    * A rseq user space pointer has been registered for the thread
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg4 and arg5 must be zero
> +ENOTSUPP  Functionality was disabled on the kernel command line
> +ENXIO	  Available, but no rseq user struct registered
> +========= ==============================================================
> +
> +The state can be also queried via prctl(2)::
> +
> +  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
> +
> +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
> +disabled. Otherwise it returns with the following error codes:
> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg3 and arg4 and arg5 must be zero
> +========= ==============================================================
> +
> +The availability and status is also exposed via the rseq ABI struct flags
> +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
> +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
> +space and only for informational purposes.

Do those flags have a meaning within the struct rseq_cs @flags field as
well, or just within the struct rseq flags field ?

> +
> +If the mechanism was enabled via prctl(), the thread can request a time
> +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
> +rseq slice_ctrl field. If the thread is interrupted and the interrupt
> +results in a reschedule request in the kernel, then the kernel can grant a
> +time slice extension and return to user space instead of scheduling
> +out.
> +
> +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
> +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
> +field. If there is a reschedule of the thread after granting the extension,
> +the kernel clears the granted bit to indicate that to user space.
> +
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
> +
> +If the granted bit is set, then user space has to invoke rseq_slice_yield()
> +when leaving the critical section to relinquish the CPU. The kernel
> +enforces this by arming a timer to prevent misbehaving user space from
> +abusing this mechanism.
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by user space.
> +
> +The required code flow is as follows::
> +
> +    rseq->slice_ctrl = REQUEST;
> +    critical_section();
> +    if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> +        if (rseq->slice_ctrl & GRANTED)
> +                rseq_slice_yield();
> +    }
> +
> +local_test_and_clear_bit() has to be local CPU atomic to prevent the
> +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
> +without LOCK prefix. On architectures, which do not provide lightweight CPU
> +local atomics this needs to be implemented with regular atomic operations.
> +
> +Setting REQUEST has no atomicity requirements as there is no concurrency
> +vs. the GRANTED bit.
> +
> +Checking the GRANTED has no atomicity requirements as there is obviously a
> +race which cannot be avoided at all::
> +
> +    if (rseq->slice_ctrl & GRANTED)
> +      -> Interrupt results in schedule and grant revocation
> +        rseq_slice_yield();
> +
> +So there is no point in pretending that this might be solved by an atomic
> +operation.

See my cover letter comments about the algorithm above.

Thanks,

Mathieu

> +
> +The kernel enforces flag consistency and terminates the thread with SIGSEGV
> +if it detects a violation.
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -71,12 +71,35 @@ struct rseq_ids {
>   };
>   
>   /**
> + * union rseq_slice_state - Status information for rseq time slice extension
> + * @state:	Compound to access the overall state
> + * @enabled:	Time slice extension is enabled for the task
> + * @granted:	Time slice extension was granted to the task
> + */
> +union rseq_slice_state {
> +	u16			state;
> +	struct {
> +		u8		enabled;
> +		u8		granted;
> +	};
> +};
> +
> +/**
> + * struct rseq_slice - Status information for rseq time slice extension
> + * @state:	Time slice extension state
> + */
> +struct rseq_slice {
> +	union rseq_slice_state	state;
> +};
> +
> +/**
>    * struct rseq_data - Storage for all rseq related data
>    * @usrptr:	Pointer to the registered user space RSEQ memory
>    * @len:	Length of the RSEQ region
>    * @sig:	Signature of critial section abort IPs
>    * @event:	Storage for event management
>    * @ids:	Storage for cached CPU ID and MM CID
> + * @slice:	Storage for time slice extension data
>    */
>   struct rseq_data {
>   	struct rseq __user		*usrptr;
> @@ -84,6 +107,9 @@ struct rseq_data {
>   	u32				sig;
>   	struct rseq_event		event;
>   	struct rseq_ids			ids;
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +	struct rseq_slice		slice;
> +#endif
>   };
>   
>   #else /* CONFIG_RSEQ */
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
>   };
>   
>   enum rseq_cs_flags_bit {
> +	/* Historical and unsupported bits */
>   	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
>   	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
>   	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
> +	/* (3) Intentional gap to put new bits into a seperate byte */
> +
> +	/* User read only feature flags */
> +	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
> +	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
>   };
>   
>   enum rseq_cs_flags {
> @@ -35,6 +41,22 @@ enum rseq_cs_flags {
>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
>   	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> +
> +	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
> +		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
> +	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
> +		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
> +};
> +
> +enum rseq_slice_bits {
> +	/* Time slice extension ABI bits */
> +	RSEQ_SLICE_EXT_REQUEST_BIT		= 0,
> +	RSEQ_SLICE_EXT_GRANTED_BIT		= 1,
> +};
> +
> +enum rseq_slice_masks {
> +	RSEQ_SLICE_EXT_REQUEST	= (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
> +	RSEQ_SLICE_EXT_GRANTED	= (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
>   };
>   
>   /*
> @@ -142,6 +164,12 @@ struct rseq {
>   	__u32 mm_cid;
>   
>   	/*
> +	 * Time slice extension control word. CPU local atomic updates from
> +	 * kernel and user space.
> +	 */
> +	__u32 slice_ctrl;
> +
> +	/*
>   	 * Flexible array member at end of structure, after last feature field.
>   	 */
>   	char end[];
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>   
>   	  If unsure, say N.
>   
> +config RSEQ_SLICE_EXTENSION
> +	bool "Enable rseq based time slice extension mechanism"
> +	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> +	help
> +          Allows userspace to request a limited time slice extension when
> +	  returning from an interrupt to user space via the RSEQ shared
> +	  data ABI. If granted, that allows to complete a critical section,
> +	  so that other threads are not stuck on a conflicted resource,
> +	  while the task is scheduled out.
> +
> +	  If unsure, say N.
> +
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
>    */
>   SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
>   {
> +	u32 rseqfl = 0;
> +
>   	if (flags & RSEQ_FLAG_UNREGISTER) {
>   		if (flags & ~RSEQ_FLAG_UNREGISTER)
>   			return -EINVAL;
> @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>   		return -EFAULT;
>   
> +	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> +		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> +
> +	if (put_user_masked_u32(rseqfl, &rseq->flags))
> +		return -EFAULT;
> +
>   	/*
>   	 * Activate the registration by setting the rseq area address, length
>   	 * and signature in the task struct.
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ