[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <159c984d-37fc-4b63-acf3-d0409c9b57cd@efficios.com>
Date: Thu, 11 Sep 2025 11:27:04 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zilstra <peterz@...radead.org>,
"Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
Jonathan Corbet <corbet@....net>,
Prakash Sangappa <prakash.sangappa@...cle.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org
Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism
On 2025-09-08 18:59, Thomas Gleixner wrote:
> This is the proper implementation of the PoC code, which I posted in reply
> to the latest iteration of Prakash's time slice extension patches:
>
> https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
>
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
>
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.
>
> This has been attempted to solve at least for a decade, but so far this
> went nowhere. The recent attempts, which started to integrate with the
> already existing RSEQ mechanism, have been at least going into the right
> direction. The full history is partially in the above mentioned mail thread
> and it's ancestors, but also in various threads in the LKML archives, which
it's -> its
> require archaeological efforts to retrieve.
>
> When trying to morph the PoC into actual mergeable code, I stumbled over
> various shortcomings in the RSEQ code, which have been addressed in a
> separate effort. The latest iteration can be found here:
>
> https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
>
> That is a prerequisite for this series as it allows a tight integration
> into the RSEQ code without inflicting a lot of extra overhead into the hot
> paths.
>
> The main change vs. the PoC and the previous attempts is that it utilizes a
> new field in the user space ABI rseq struct, which allows to reduce the
> atomic operations in user space to a bare minimum. If the architecture
> supports CPU local atomics, which protect against the obvious RMW race
> vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
> x86, required.
Good!
>
> The kernel user space ABI consists only of two bits in this new field:
>
> REQUEST and GRANTED
>
> User space sets REQUEST at the begin of the critical section. If it
beginning
> finishes the critical section without interruption then it can clear the
> bit and move on.
>
> If it is interrupted and the interrupt return path in the kernel observes a
> rescheduling request, then the kernel can grant a time slice extension. The
> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
> non-atomic store operation. If it does not grant the extension only the
> REQUEST bit is cleared.
>
> If user space observes the REQUEST bit cleared, when it finished the
> critical section, then it has to check the GRANTED bit. If that is set,
> then it has to invoke the rseq_slice_yield() syscall to terminate the
Does it "have" to ? What is the consequence of misbehaving ?
> extension and yield the CPU.
>
> The code flow in user space is:
>
> // Simple store as there is no concurrency vs. the GRANTED bit
> rseq->slice_ctrl = REQUEST;
>
> critical_section();
>
> // CPU local atomic required here:
> if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> // Non-atomic check is sufficient as this can race
> // against an interrupt, which revokes the grant
> //
> // If not set, then the request was either cleared by the kernel
> // without grant or the grant was revoked.
> //
> // If set, tell the kernel that the critical section is done
> // so it can reschedule
> if (rseq->slice_ctrl & GRANTED)
> rseq_slice_yield();
I wonder if we could achieve this without the cpu-local atomic, and
just rely on simple relaxed-atomic or volatile loads/stores and compiler
barriers in userspace. Let's say we have:
union {
u16 slice_ctrl;
struct {
u8 rseq->slice_request;
u8 rseq->slice_grant;
};
};
With userspace doing:
rseq->slice_request = true; /* WRITE_ONCE() */
barrier();
critical_section();
barrier();
rseq->slice_request = false; /* WRITE_ONCE() */
if (rseq->slice_grant) /* READ_ONCE() */
rseq_slice_yield();
In the kernel interrupt return path, if the kernel observes
"rseq->slice_request" set and "rseq->slice_grant" cleared,
it grants the extension and sets "rseq->slice_grant".
rseq_slice_yield() clears rseq->slice_grant.
> }
>
> The other details, which differ from earlier attempts and the PoC, are:
>
> - A separate syscall for terminating the extension to avoid side
> effects and overloading of the already ill defined sched_yield(2)
>
> - A separate per CPU timer, which again does not inflict side effects
> on the scheduler internal hrtick timer. The hrtick timer can be
> disabled at run-time and an expiry can cause interesting problems in
> the scheduler code when it is unexpectedly invoked.
>
> - Tight integration into the rseq exit to user mode code. It utilizes
> the path when TIF_RESQ is not set at the end of exit_to_user_mode()
TIF_RSEQ
> to arm the timer if an extension was granted. TIF_RSEQ indicates that
> the task was scheduled and therefore would revoke the grant anyway.
>
> - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
> model which is utilized by PREEMPT_RT.
Can you clarify why this attempt is "futile" ?
Thanks,
Mathieu
>
> It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
> but not TIF_PREEMPT.
>
> Pretending that this can be made work for TIF_PREEMPT on a fully
> preemptible kernel is just wishful thinking as the chance that
> TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
> obvious reasons.
>
> This only "works" by some definition of works, i.e. on a best effort
> basis, for the PREEMPT_NONE model and nothing else. Though given the
> problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
> running code sections, the days of these models should be hopefully
> numbered and everything consolidated on the LAZY model.
>
> That makes this distinction moot and everything restricted to
> TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
> extension mechanism into the scheduler hotpath. I'm sure there will
> be attempts to do that as there is no lack of crazy folks out
> there...
>
> - Actual documentation of the user space ABI and a initial self test.
>
> The RSEQ modifications on which this series is based can be found here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
>
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Thanks,
>
> tglx
> ---
> Documentation/userspace-api/index.rst | 1
> Documentation/userspace-api/rseq.rst | 129 ++++++++++++
> arch/alpha/kernel/syscalls/syscall.tbl | 1
> arch/arm/tools/syscall.tbl | 1
> arch/arm64/tools/syscall_32.tbl | 1
> arch/m68k/kernel/syscalls/syscall.tbl | 1
> arch/microblaze/kernel/syscalls/syscall.tbl | 1
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1
> arch/parisc/kernel/syscalls/syscall.tbl | 1
> arch/powerpc/kernel/syscalls/syscall.tbl | 1
> arch/s390/kernel/syscalls/syscall.tbl | 1
> arch/s390/mm/pfault.c | 3
> arch/sh/kernel/syscalls/syscall.tbl | 1
> arch/sparc/kernel/syscalls/syscall.tbl | 1
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> arch/xtensa/kernel/syscalls/syscall.tbl | 1
> include/linux/entry-common.h | 2
> include/linux/rseq.h | 11 +
> include/linux/rseq_entry.h | 176 ++++++++++++++++
> include/linux/rseq_types.h | 28 ++
> include/linux/sched.h | 7
> include/linux/syscalls.h | 1
> include/linux/thread_info.h | 16 -
> include/uapi/asm-generic/unistd.h | 5
> include/uapi/linux/prctl.h | 10
> include/uapi/linux/rseq.h | 28 ++
> init/Kconfig | 12 +
> kernel/entry/common.c | 14 +
> kernel/entry/syscall-common.c | 11 -
> kernel/rcu/tiny.c | 8
> kernel/rcu/tree.c | 14 -
> kernel/rcu/tree_exp.h | 3
> kernel/rcu/tree_plugin.h | 9
> kernel/rcu/tree_stall.h | 3
> kernel/rseq.c | 293 ++++++++++++++++++++++++++++
> kernel/sys.c | 6
> kernel/sys_ni.c | 1
> scripts/syscall.tbl | 1
> tools/testing/selftests/rseq/.gitignore | 1
> tools/testing/selftests/rseq/Makefile | 5
> tools/testing/selftests/rseq/rseq-abi.h | 2
> tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++
> 45 files changed, 991 insertions(+), 42 deletions(-)
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists