linux-kernel - Re: [patch 00/12] rseq: Implement time slice extension mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a65dfd2c-b435-4d83-89d0-abc8002db7c7@efficios.com>
Date: Fri, 12 Sep 2025 15:26:22 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zilstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Jonathan Corbet <corbet@....net>,
 Prakash Sangappa <prakash.sangappa@...cle.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org,
 Florian Weimer <fweimer@...hat.com>, "carlos@...hat.com"
 <carlos@...hat.com>, libc-coord@...ts.openwall.com
Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism

[ For those just CC'd on this thread, the discussion is about time slice
   extension for userspace critical sections. We are specifically
   discussing the kernel ABI we plan to expose to userspace. ]

On 2025-09-12 12:31, Thomas Gleixner wrote:
> On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
>> On 2025-09-11 16:18, Thomas Gleixner wrote:
>>> It receives SIGSEGV because that means that it did not follow the rules
>>> and stuck an arbitrary syscall into the critical section.
>>
>> Not following the rules could also be done by just looping for a long
>> time in userspace within or after the critical section, in which case
>> the timer should catch it.
> 
> It's pretty much impossible to tell for the kernel without more
> overhead, whether that's actually a violation of the rules or not.
> 
> The operation after the grant can be interrupted (without resulting in
> scheduling), which is out of control of the task which got the extension
> granted.
> 
> The timer is there to ensure that there is an upper bound to the grant
> independent of the actual reason.

If the worse side-effect of this feature is that the slice extension
is not granted when users misbehave, IMHO this would increase the
likelihood of adoption compared to failure modes that end up killing the
offending processes.

> 
> Going through a different syscall is an obvious deviation from the rule.

AFAIU, the grant is cleared when a signal handler is delivered, which
makes it OK for signals to issue system calls even if they are nested
on top of a granted extension critical section.

> 
> As far I understood the earlier discussions, scheduler folks want to
> enforce that because of PREEMPT_NONE semantics, where a randomly chosen
> syscall might not result in an immediate reschedule because the work,
> which needs to be done takes arbitrary time to complete.
> 
> Though that's arguably not much different from
> 
>         syscall()
>                  -> tick -> NEED_RESCHED
>          do_tons_of_work();
>         exit_to_user()
>            schedule();
> 
> except that in the slice extension case, the latency increases by the
> slice extension time.
> 
> If we allow arbitrary syscalls to terminate the grant, then we need to
> stick an immediate schedule() into the syscall entry work function. We'd
> still need the separate yield() syscall to provide a side effect free
> way of termination.
> 
> I have no strong opinions either way. Peter?

If it happens to not be too bothersome to allow arbitrary system calls
to act as implicit rseq_slice_yield() rather than result in a
segmentation fault, I think it would make this feature more widely
adopted.

Another scenario I have in mind is a userspace critical section that
would typically benefit from slice extension, but seldomly requires
to issue a system call. In C and higher level languages, that could be
very much outside of the user control, such as accessing a
global-dynamic TLS variable located within a global-dynamic shared
object, which can trigger memory allocation under the hood on first
access.

Handling syscall within granted extension by killing the process
will likely reserve this feature to the niche use-cases.

> 
>>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>>> barrier();
>>>> critical_section();
>>>> barrier();
>>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>>      rseq_slice_yield();
>>>
>>> That should work as it's strictly CPU local. Good point, now that you
>>> said it it's obvious :)
>>>
>>> Let me rework it accordingly.
>>
>> I have two questions wrt ABI here:
>>
>> 1) Do we expect the slice requests to be done from C and higher level
>>      languages or only from assembly ?
> 
> It doesn't matter as long as the ordering is guaranteed.

OK, so I understand that you intend to target higher level languages
as well, which makes my second point (nesting) relevant.

> 
>> 2) Slice requests are a good fit for locking. Locking typically
>>      has nesting ability.
>>
>>      We should consider making the slice request ABI a 8-bit
>>      or 16-bit nesting counter to allow nesting of its users.
> 
> Making request a counter requires to keep request set when the
> extension is granted. So the states would be:
> 
>       request    granted
>       0          0               Neutral
>       >0         0               Requested
>       >=0        1               Granted

Yes.

> 
> That should work.
> 
> Though I'm not really convinced that unconditionally embeddeding it into
> random locking primitives is the right thing to do.

Me neither. I wonder what would be a good approach to integrate this
with locking APIs. Here are a few ideas, some worse than others:

- Extend pthread_mutexattr_t to set whether the mutex should be
   slice-extended. Downside: if a mutex has some long and some
   short critical sections, it's really a one-size fits all decision
   for all critical sections for that mutex.

- Extend the pthread_mutex_lock/trylock with new APIs to allow
   specifying whether slice-extension is needed for the upcoming critical
   section.

- Just let the pthread_mutex_lock caller explicitly request the
   slice extension *after* grabbing the lock. Downside: this opens
   a window of a few instructions where preemption can happen
   and slice extension would have been useful. Should we care ?

> 
> The extension makes only sense, when the actual critical section is
> small and likely to complete within the extension time, which is usually
> only true for highly optimized code and not for general usage, where the
> lock held section is arbitrary long and might even result in syscalls
> even if the critical section itself does not have an obvious explicit
> syscall embedded:
> 
>       lock(a)
>          lock(b) <- Contention results in syscall

Nested locking is another scenario where _typically_ we'd want the
slice extension for the outer lock if it is expected to be a short
critical section, and sometimes hit futex while the extension is granted
and clear the grant if this happens without killing the process.

> 
> Same applies for library functions within a critical section.

Yes.

> 
> That then immediately conflicts with the yield mechanism rules, because
> the extension could have been granted _before_ the syscall happens, so
> we'd have remove that restriction too.

Yes.

> 
> That said, we can make the ABI a counter and split the slice control
> word into two u16. So the decision function would be:
> 
>       get_usr(ctrl);
>       if (!ctrl.request)
>       	return;
>       ....
>       ctrl.granted = 1;
>       put_usr(ctrl);
> 
> Along with documentation why this should only be used nested when you
> know what you are doing.

Yes.

This would turn the end of critical section into a
decrement-and-test-for-zero. It's only when the request counter
decrements back to zero that userspace should handle the granted
flag and yield.

> 
>> 3) Slice requests are also a good fit for rseq critical sections.
>>      Of course someone could explicitly increment/decrement the
>>      slice request counter before/after the rseq critical sections, but
>>      I think we could do better there and integrate this directly within
>>      the struct rseq_cs as a new critical section flag. Basically, a
>>      critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>>      better name) set within its descriptor flags would behave as if
>>      the slice request counter is non-zero when preempted without
>>      requiring any extra instruction on the fast path. The only
>>      added overhead would be a check of the rseq->slice_grant flag
>>      when exiting the critical section to conditionally issue
>>      rseq_slice_yield().
> 
> Plus checking first whether rseq->slice.request is actually zero,
> i.e. whether the rseq critical section was the outermost one. If not,
> you cannot invoke the yield even if granted is true, right?

Right.

> 
> But mixing state spaces is not really a good idea at all. Let's not go
> there.

I agree, let's keep this (3) for later if there is a strong use-case
justifying the complexity.

What is important for right now though is to figure out the behavior
with respect to an ongoing rseq critical section when a time slice
extension is granted: is the rseq critical section aborted or does
it keep going on return to userspace ?

> 
> Also you'd make checking of rseq_cs unconditional, which means extra
> work in the grant decision function as it would then have to do:
> 
>           if (!usr->slice.ctrl.request) {
>              if (!usr->rseq_cs)
>                 return;
>              if (!valid_ptr(usr->rseq_cs))
>                 goto die;
>              if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
>                 return;
>              if (!(use->rseq_cs.flags & REQUEST))
>                 return;
>           }
> 
> IOW, we'd copy half of the rseq cs handling into that code.
> 
> Can we please keep it independent and simple?

Of course.

So in summary, here is my current understanding:

- It would be good to support nested slice-extension requests,

- It would be preferable to allow arbitrary system calls to
   cancel an ongoing slice-extension grant rather than kill the
   process if we want the slice-extension feature to be useful
   outside of niche use-cases.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
>          tglx


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com