linux-kernel - Re: [patch 00/12] rseq: Implement time slice extension mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <159c984d-37fc-4b63-acf3-d0409c9b57cd@efficios.com>
Date: Thu, 11 Sep 2025 11:27:04 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zilstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Jonathan Corbet <corbet@....net>,
 Prakash Sangappa <prakash.sangappa@...cle.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org
Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism

On 2025-09-08 18:59, Thomas Gleixner wrote:
> This is the proper implementation of the PoC code, which I posted in reply
> to the latest iteration of Prakash's time slice extension patches:
> 
>       https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
> 
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
> 
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.
> 
> This has been attempted to solve at least for a decade, but so far this
> went nowhere.  The recent attempts, which started to integrate with the
> already existing RSEQ mechanism, have been at least going into the right
> direction. The full history is partially in the above mentioned mail thread
> and it's ancestors, but also in various threads in the LKML archives, which

it's -> its

> require archaeological efforts to retrieve.
> 
> When trying to morph the PoC into actual mergeable code, I stumbled over
> various shortcomings in the RSEQ code, which have been addressed in a
> separate effort. The latest iteration can be found here:
> 
>       https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
> 
> That is a prerequisite for this series as it allows a tight integration
> into the RSEQ code without inflicting a lot of extra overhead into the hot
> paths.
> 
> The main change vs. the PoC and the previous attempts is that it utilizes a
> new field in the user space ABI rseq struct, which allows to reduce the
> atomic operations in user space to a bare minimum. If the architecture
> supports CPU local atomics, which protect against the obvious RMW race
> vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
> x86, required.

Good!

> 
> The kernel user space ABI consists only of two bits in this new field:
> 
> 	REQUEST and GRANTED
> 
> User space sets REQUEST at the begin of the critical section. If it

beginning

> finishes the critical section without interruption then it can clear the
> bit and move on.
> 
> If it is interrupted and the interrupt return path in the kernel observes a
> rescheduling request, then the kernel can grant a time slice extension. The
> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
> non-atomic store operation. If it does not grant the extension only the
> REQUEST bit is cleared.
> 
> If user space observes the REQUEST bit cleared, when it finished the
> critical section, then it has to check the GRANTED bit. If that is set,
> then it has to invoke the rseq_slice_yield() syscall to terminate the

Does it "have" to ? What is the consequence of misbehaving ?

> extension and yield the CPU.
> 
> The code flow in user space is:
> 
>     	  // Simple store as there is no concurrency vs. the GRANTED bit
>        	  rseq->slice_ctrl = REQUEST;
> 
> 	  critical_section();
> 
> 	  // CPU local atomic required here:
> 	  if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> 	     	// Non-atomic check is sufficient as this can race
> 		// against an interrupt, which revokes the grant
> 		//
> 		// If not set, then the request was either cleared by the kernel
> 		// without grant or the grant was revoked.
> 		//
> 		// If set, tell the kernel that the critical section is done
> 		// so it can reschedule
> 	  	if (rseq->slice_ctrl & GRANTED)
> 			rseq_slice_yield();

I wonder if we could achieve this without the cpu-local atomic, and
just rely on simple relaxed-atomic or volatile loads/stores and compiler
barriers in userspace. Let's say we have:

union {
	u16 slice_ctrl;
	struct {
		u8 rseq->slice_request;
		u8 rseq->slice_grant;
	};
};

With userspace doing:

rseq->slice_request = true;  /* WRITE_ONCE() */
barrier();
critical_section();
barrier();
rseq->slice_request = false; /* WRITE_ONCE() */
if (rseq->slice_grant)       /* READ_ONCE() */
   rseq_slice_yield();

In the kernel interrupt return path, if the kernel observes
"rseq->slice_request" set and "rseq->slice_grant" cleared,
it grants the extension and sets "rseq->slice_grant".

rseq_slice_yield() clears rseq->slice_grant.


> 	  }
> 
> The other details, which differ from earlier attempts and the PoC, are:
> 
>      - A separate syscall for terminating the extension to avoid side
>        effects and overloading of the already ill defined sched_yield(2)
> 
>      - A separate per CPU timer, which again does not inflict side effects
>        on the scheduler internal hrtick timer. The hrtick timer can be
>        disabled at run-time and an expiry can cause interesting problems in
>        the scheduler code when it is unexpectedly invoked.
> 
>      - Tight integration into the rseq exit to user mode code. It utilizes
>        the path when TIF_RESQ is not set at the end of exit_to_user_mode()

TIF_RSEQ

>        to arm the timer if an extension was granted. TIF_RSEQ indicates that
>        the task was scheduled and therefore would revoke the grant anyway.
> 
>      - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>        model which is utilized by PREEMPT_RT.

Can you clarify why this attempt is "futile" ?

Thanks,

Mathieu

> 
>        It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
>        but not TIF_PREEMPT.
> 
>        Pretending that this can be made work for TIF_PREEMPT on a fully
>        preemptible kernel is just wishful thinking as the chance that
>        TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
>        obvious reasons.
> 
>        This only "works" by some definition of works, i.e. on a best effort
>        basis, for the PREEMPT_NONE model and nothing else. Though given the
>        problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
>        running code sections, the days of these models should be hopefully
>        numbered and everything consolidated on the LAZY model.
> 
>        That makes this distinction moot and everything restricted to
>        TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
>        extension mechanism into the scheduler hotpath. I'm sure there will
>        be attempts to do that as there is no lack of crazy folks out
>        there...
> 
>      - Actual documentation of the user space ABI and a initial self test.
> 
> The RSEQ modifications on which this series is based can be found here:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 
> Thanks,
> 
> 	tglx
> ---
>   Documentation/userspace-api/index.rst       |    1
>   Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
>   arch/alpha/kernel/syscalls/syscall.tbl      |    1
>   arch/arm/tools/syscall.tbl                  |    1
>   arch/arm64/tools/syscall_32.tbl             |    1
>   arch/m68k/kernel/syscalls/syscall.tbl       |    1
>   arch/microblaze/kernel/syscalls/syscall.tbl |    1
>   arch/mips/kernel/syscalls/syscall_n32.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_n64.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_o32.tbl   |    1
>   arch/parisc/kernel/syscalls/syscall.tbl     |    1
>   arch/powerpc/kernel/syscalls/syscall.tbl    |    1
>   arch/s390/kernel/syscalls/syscall.tbl       |    1
>   arch/s390/mm/pfault.c                       |    3
>   arch/sh/kernel/syscalls/syscall.tbl         |    1
>   arch/sparc/kernel/syscalls/syscall.tbl      |    1
>   arch/x86/entry/syscalls/syscall_32.tbl      |    1
>   arch/x86/entry/syscalls/syscall_64.tbl      |    1
>   arch/xtensa/kernel/syscalls/syscall.tbl     |    1
>   include/linux/entry-common.h                |    2
>   include/linux/rseq.h                        |   11 +
>   include/linux/rseq_entry.h                  |  176 ++++++++++++++++
>   include/linux/rseq_types.h                  |   28 ++
>   include/linux/sched.h                       |    7
>   include/linux/syscalls.h                    |    1
>   include/linux/thread_info.h                 |   16 -
>   include/uapi/asm-generic/unistd.h           |    5
>   include/uapi/linux/prctl.h                  |   10
>   include/uapi/linux/rseq.h                   |   28 ++
>   init/Kconfig                                |   12 +
>   kernel/entry/common.c                       |   14 +
>   kernel/entry/syscall-common.c               |   11 -
>   kernel/rcu/tiny.c                           |    8
>   kernel/rcu/tree.c                           |   14 -
>   kernel/rcu/tree_exp.h                       |    3
>   kernel/rcu/tree_plugin.h                    |    9
>   kernel/rcu/tree_stall.h                     |    3
>   kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
>   kernel/sys.c                                |    6
>   kernel/sys_ni.c                             |    1
>   scripts/syscall.tbl                         |    1
>   tools/testing/selftests/rseq/.gitignore     |    1
>   tools/testing/selftests/rseq/Makefile       |    5
>   tools/testing/selftests/rseq/rseq-abi.h     |    2
>   tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
>   45 files changed, 991 insertions(+), 42 deletions(-)
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com