linux-kernel - Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9abc6246-2a90-44ea-b9ed-dd2d3c5440dc@efficios.com>
Date: Fri, 12 Sep 2025 11:44:05 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
 <seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>, x86@...nel.org,
 Arnd Bergmann <arnd@...db.de>, Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to
 user

On 2025-09-12 10:22, Thomas Gleixner wrote:
> On Thu, Sep 11 2025 at 16:00, Mathieu Desnoyers wrote:
>> On 2025-09-11 12:47, Thomas Gleixner wrote:
>>> The only cases where this is not true, are when there is no capacity to
>>> do so or on UP or the parent was affine to a single CPU, which is what
>>> the child inherits.
>>
>> Wrong. See above.
> 
> Huch? You just confirmed what I was saying. It's not true for
> overcommit, i.e. there is no capacity to do place it on a different CPU
> right away.

I agree considering that being "in no capacity to do so" includes
the fact that no other CPU are immediately available to run it.
I misunderstood this as being a longer-term configuration constraint
rather than a transient thing.

> 
>>> the fault will happen in the fast path first, which means the exit code
>>> has to go another round through the work loop instead of forcing the
>>> fault right away on the first exit in the slowpath, where it can be
>>> actually resolved.
>>
>> I think you misunderstood my improvement suggestion: I do _not_
>> recommend that we pointlessly take the fast-path on fork, but rather
>> than we keep the cached cpu_cid value (from the parent) around to check
>> whether we actually need to store to userspace and take a page fault.
>> Based on this, we can choose whether we use the fast-path or go directly
>> to the slow-path.
> 
> That's not that trivial because you need extra an extra decision in the
> scheduler to take the slowpath based on information whether that's the
> first schedule and change after fork so the scheduler can set either
> TIF_RSEQ or NOTIFY along with the slowpath flag.
> 
> But that's just not worth it. So forcing the slowpath after fork seemed
> to be a sensible thing to do.

I agree that it may not be worth the complexity.

> 
> Anyway, I just took the numbers for a make -j128 build on a 64CPU
> systems again with the force removed. I have to correct my claim about
> vast majority for that case, but 69% of forked childs which fault in the
> TLS/RSEQ page before exit or exec is definitely not a small number.
> 
> The pattern, which brought me to actually force the slowpath
> unconditionally was:
> 
>    child()
>    ...
>    return to user:
> 
>      fast_path -> faults
>      slow_path -> faults and handles the fault
>      fast_path    (usually a NOOP, but not necessarily)
>      done
> 
> Though I'm not insisting on keeping the fork force slowpath mechanics.

I'm OK forcing the slowpath on return to user from fork, but the
comments justifying why we do this should be worded in more nuanced
terms with regards to the ratio of expected faults on fork and
variability with workloads.

That opens the door to further improvements in the future based on
additional metrics.

> 
> As we might eventually agree on, it depends on the workload and the
> state of the system, so the milage may vary pretty much.

Yes. I'm OK with just altering the comments and commit message
to a wording that is less absolute about the optimization choices
made here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com