linux-kernel - Re: [patch 00/12] rseq: Implement time slice extension mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a52zr5sv.ffs@tglx>
Date: Fri, 12 Sep 2025 18:31:28 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, LKML
 <linux-kernel@...r.kernel.org>
Cc: Peter Zilstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet
 <corbet@....net>, Prakash Sangappa <prakash.sangappa@...cle.com>, Madadi
 Vineeth Reddy <vineethr@...ux.ibm.com>, K Prateek Nayak
 <kprateek.nayak@....com>, Steven Rostedt <rostedt@...dmis.org>, Sebastian
 Andrzej Siewior <bigeasy@...utronix.de>, Arnd Bergmann <arnd@...db.de>,
 linux-arch@...r.kernel.org
Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism

On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
> On 2025-09-11 16:18, Thomas Gleixner wrote:
>> It receives SIGSEGV because that means that it did not follow the rules
>> and stuck an arbitrary syscall into the critical section.
>
> Not following the rules could also be done by just looping for a long
> time in userspace within or after the critical section, in which case
> the timer should catch it.

It's pretty much impossible to tell for the kernel without more
overhead, whether that's actually a violation of the rules or not.

The operation after the grant can be interrupted (without resulting in
scheduling), which is out of control of the task which got the extension
granted.

The timer is there to ensure that there is an upper bound to the grant
independent of the actual reason.

Going through a different syscall is an obvious deviation from the rule.

As far I understood the earlier discussions, scheduler folks want to
enforce that because of PREEMPT_NONE semantics, where a randomly chosen
syscall might not result in an immediate reschedule because the work,
which needs to be done takes arbitrary time to complete.

Though that's arguably not much different from

       syscall()
                -> tick -> NEED_RESCHED
        do_tons_of_work();
       exit_to_user()
          schedule();

except that in the slice extension case, the latency increases by the
slice extension time.

If we allow arbitrary syscalls to terminate the grant, then we need to
stick an immediate schedule() into the syscall entry work function. We'd
still need the separate yield() syscall to provide a side effect free
way of termination.

I have no strong opinions either way. Peter?

>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>> barrier();
>>> critical_section();
>>> barrier();
>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>     rseq_slice_yield();
>> 
>> That should work as it's strictly CPU local. Good point, now that you
>> said it it's obvious :)
>> 
>> Let me rework it accordingly.
>
> I have two questions wrt ABI here:
>
> 1) Do we expect the slice requests to be done from C and higher level
>     languages or only from assembly ?

It doesn't matter as long as the ordering is guaranteed.

> 2) Slice requests are a good fit for locking. Locking typically
>     has nesting ability.
>
>     We should consider making the slice request ABI a 8-bit
>     or 16-bit nesting counter to allow nesting of its users.

Making request a counter requires to keep request set when the
extension is granted. So the states would be:

     request    granted
     0          0               Neutral
     >0         0               Requested
     >=0        1               Granted

That should work.

Though I'm not really convinced that unconditionally embeddeding it into
random locking primitives is the right thing to do.

The extension makes only sense, when the actual critical section is
small and likely to complete within the extension time, which is usually
only true for highly optimized code and not for general usage, where the
lock held section is arbitrary long and might even result in syscalls
even if the critical section itself does not have an obvious explicit
syscall embedded:

     lock(a)
        lock(b) <- Contention results in syscall

Same applies for library functions within a critical section.

That then immediately conflicts with the yield mechanism rules, because
the extension could have been granted _before_ the syscall happens, so
we'd have remove that restriction too.

That said, we can make the ABI a counter and split the slice control
word into two u16. So the decision function would be:

     get_usr(ctrl);
     if (!ctrl.request)
     	return;
     ....
     ctrl.granted = 1;
     put_usr(ctrl);

Along with documentation why this should only be used nested when you
know what you are doing.

> 3) Slice requests are also a good fit for rseq critical sections.
>     Of course someone could explicitly increment/decrement the
>     slice request counter before/after the rseq critical sections, but
>     I think we could do better there and integrate this directly within
>     the struct rseq_cs as a new critical section flag. Basically, a
>     critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>     better name) set within its descriptor flags would behave as if
>     the slice request counter is non-zero when preempted without
>     requiring any extra instruction on the fast path. The only
>     added overhead would be a check of the rseq->slice_grant flag
>     when exiting the critical section to conditionally issue
>     rseq_slice_yield().

Plus checking first whether rseq->slice.request is actually zero,
i.e. whether the rseq critical section was the outermost one. If not,
you cannot invoke the yield even if granted is true, right?

But mixing state spaces is not really a good idea at all. Let's not go
there.

Also you'd make checking of rseq_cs unconditional, which means extra
work in the grant decision function as it would then have to do:

         if (!usr->slice.ctrl.request) {
            if (!usr->rseq_cs)
               return;
            if (!valid_ptr(usr->rseq_cs))
               goto die;
            if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
               return;
            if (!(use->rseq_cs.flags & REQUEST))
               return;
         }

IOW, we'd copy half of the rseq cs handling into that code.

Can we please keep it independent and simple?

Thanks,

        tglx