linux-kernel - Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87frcrrbrx.ffs@tglx>
Date: Fri, 12 Sep 2025 16:22:26 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, LKML
 <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>, Paolo Bonzini
 <pbonzini@...hat.com>, Sean Christopherson <seanjc@...gle.com>, Wei Liu
 <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>, x86@...nel.org,
 Arnd Bergmann <arnd@...db.de>, Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>, Sven Schnelle
 <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>, Paul Walmsley
 <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V4 28/36] rseq: Switch to fast path processing on exit
 to user

On Thu, Sep 11 2025 at 16:00, Mathieu Desnoyers wrote:
> On 2025-09-11 12:47, Thomas Gleixner wrote:
>> The normal case is that the fault is pretty much guaranteed to happen
>> because the scheduler places the child on a different CPU and therefore
>> the CPU/MM IDs need to be updated anyway.
>
> Is this "normal case" the behavior you wish the scheduler to have or
> is it based on metrics ?

I did measurements and that's how I came to the conclusion. I measured
forks vs. faults in the page which holds the RSEQ memory, aka. TLS.

> Here is the result of a sched_fork() instrumentation vs the first
> pick_next_task that follows for each thread (accounted with schedstats
> see diff at the end of this email, followed by my kernel config), and we
> can observe the following pattern with a parallel kernel build workload
> (64-core VM, on a physical machine that has 384 logical cpus):
>
> - When undercommitted, in which case the scheduler has idle core
>    targets to select (make -j32 on a 64-core VM), indeed the behavior
>    favors spreading the workload to different CPUs (in which case we
>    would hit a page fault on return from fork, as you predicted):
>    543 same-cpu vs 62k different-cpu.
>    Most of the same-cpu are on CPU 0 (402), surprisingly,
>
> - When using all the cores (make -j64 on a 64-core VM) (and
>    also for overcommit scenarios tested up to -j400), the
>    picture differs:
>    27k same-cpu vs 32k different-cpu, which is pretty much even.

Sure. In overcommit there is obviously a higher probability to stay on
the same CPU.

> This is a kernel build workload, which consist mostly of
> single-threaded processes (make, gcc). I therefore expect the
> rseq mm_cid to be always 0 for those cases.

Sure, but that does not mean the CPU stays the same.

>> The only cases where this is not true, are when there is no capacity to
>> do so or on UP or the parent was affine to a single CPU, which is what
>> the child inherits.
>
> Wrong. See above.

Huch? You just confirmed what I was saying. It's not true for
overcommit, i.e. there is no capacity to do place it on a different CPU
right away.

> Moreover, a common userspace process execution pattern is fork followed
> by exec, in which case whatever page faults happen on fork become
> irrelevant as soon as exec replaces the entire mapping.

That's correct, but for exec to happen it needs to complete the fork
first and go out to user space to do the exec without touching TLS and
without being placed on a different CPU, right?

So again, that depends on the ability of placement where this ends up.

>> the fault will happen in the fast path first, which means the exit code
>> has to go another round through the work loop instead of forcing the
>> fault right away on the first exit in the slowpath, where it can be
>> actually resolved.
>
> I think you misunderstood my improvement suggestion: I do _not_
> recommend that we pointlessly take the fast-path on fork, but rather
> than we keep the cached cpu_cid value (from the parent) around to check
> whether we actually need to store to userspace and take a page fault.
> Based on this, we can choose whether we use the fast-path or go directly
> to the slow-path.

That's not that trivial because you need extra an extra decision in the
scheduler to take the slowpath based on information whether that's the
first schedule and change after fork so the scheduler can set either
TIF_RSEQ or NOTIFY along with the slowpath flag.

But that's just not worth it. So forcing the slowpath after fork seemed
to be a sensible thing to do.

Anyway, I just took the numbers for a make -j128 build on a 64CPU
systems again with the force removed. I have to correct my claim about
vast majority for that case, but 69% of forked childs which fault in the
TLS/RSEQ page before exit or exec is definitely not a small number.

The pattern, which brought me to actually force the slowpath
unconditionally was:

  child()
  ...
  return to user:

    fast_path -> faults
    slow_path -> faults and handles the fault
    fast_path    (usually a NOOP, but not necessarily)
    done

Though I'm not insisting on keeping the fork force slowpath mechanics.

As we might eventually agree on, it depends on the workload and the
state of the system, so the milage may vary pretty much.

Thanks,

        tglx