linux-kernel - Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5c1e0366-1016-4cc0-9b86-888e8c5c2a31@efficios.com>
Date: Thu, 11 Sep 2025 10:45:43 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Michael Jeanson <mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
 Peter Zijlstra <peterz@...radead.org>, "Paul E. McKenney"
 <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
 <seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>, x86@...nel.org,
 Arnd Bergmann <arnd@...db.de>, Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V4 28/36] rseq: Switch to fast path processing on exit to
 user

On 2025-09-11 10:44, Mathieu Desnoyers wrote:
> On 2025-09-08 17:32, Thomas Gleixner wrote:
>> Now that all bits and pieces are in place, hook the RSEQ handling fast 
>> path
>> function into exit_to_user_mode_prepare() after the TIF work bits have 
>> been
>> handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
>> and the caller needs to take another turn through the TIF handling slow
>> path.
>>
>> This only works for architectures, which use the generic entry code.
> 
> Remove comma after "architectures"
> 
>> Architectures, who still have their own incomplete hacks are not 
>> supported
> 
> Remove comma after "Architectures"
> 
>> and won't be.
>>
>> This results in the following improvements:
>>
>>    Kernel build           Before          After              Reduction
>>
>>    exit to user         80692981          80514451
>>    signal checks:          32581               121           99%
>>    slowpath runs:        1201408   1.49%           198 0.00%      100%
>>    fastpath runs:                         675941 0.84%       N/A
>>    id updates:           1233989   1.53%         50541 0.06%       96%
>>    cs checks:            1125366   1.39%             0 0.00%      100%
>>      cs cleared:         1125366      100%     0            100%
>>      cs fixup:                 0        0%     0
>>
>>    RSEQ selftests      Before          After              Reduction
>>
>>    exit to user:       386281778          387373750
>>    signal checks:       35661203                  0           100%
>>    slowpath runs:      140542396 36.38%            100  0.00%    100%
>>    fastpath runs:                         9509789  2.51%     N/A
>>    id updates:         176203599 45.62%        9087994  2.35%     95%
>>    cs checks:          175587856 45.46%        4728394  1.22%     98%
>>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>      cs fixup:           3228312    1.84%    3409087   72.10%
>>
>> The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
> 
> percentages
> 
>> to user invocations, they are relative to the actual 'cs check'
>> invocations.
>>
>> While some of this could have been avoided in the original code, like the
>> obvious clearing of CS when it's already clear, the main problem of going
>> through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
>> notify handler is invoked more than once before going out to user
>> space. Doing this once when everything has stabilized is the only 
>> solution
>> to avoid this.
>>
>> The initial attempt to completely decouple it from the TIF work turned 
>> out
>> to be suboptimal for workloads, which do a lot of quick and short system
>> calls. Even if the fast path decision is only 4 instructions (including a
>> conditional branch), this adds up quickly and becomes measurable when the
>> rate for actually having to handle rseq is in the low single digit
>> percentage range of user/kernel transitions.
>>
>> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
>> ---
>> V4: Move the rseq handling into a separate loop to avoid gotos later on
>> ---
>>   include/linux/irq-entry-common.h |    7 ++-----
>>   include/linux/resume_user_mode.h |    2 +-
>>   include/linux/rseq.h             |   23 +++++++++++++++++------
>>   init/Kconfig                     |    2 +-
>>   kernel/entry/common.c            |   26 +++++++++++++++++++-------
>>   kernel/rseq.c                    |    8 ++++++--
>>   6 files changed, 46 insertions(+), 22 deletions(-)
>>
>> --- a/include/linux/irq-entry-common.h
>> +++ b/include/linux/irq-entry-common.h
>> @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
>>    */
>>   void arch_do_signal_or_restart(struct pt_regs *regs);
>> -/**
>> - * exit_to_user_mode_loop - do any pending work before leaving to 
>> user space
>> - */
>> -unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -                     unsigned long ti_work);
>> +/* Handle pending TIF work */
>> +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned 
>> long ti_work);
>>   /**
>>    * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if 
>> required
>> --- a/include/linux/resume_user_mode.h
>> +++ b/include/linux/resume_user_mode.h
>> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>>       mem_cgroup_handle_over_high(GFP_KERNEL);
>>       blkcg_maybe_throttle_current();
>> -    rseq_handle_notify_resume(regs);
>> +    rseq_handle_slowpath(regs);
>>   }
>>   #endif /* LINUX_RESUME_USER_MODE_H */
>> --- a/include/linux/rseq.h
>> +++ b/include/linux/rseq.h
>> @@ -5,13 +5,19 @@
>>   #ifdef CONFIG_RSEQ
>>   #include <linux/sched.h>
>> -void __rseq_handle_notify_resume(struct pt_regs *regs);
>> +void __rseq_handle_slowpath(struct pt_regs *regs);
>> -static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>> +/* Invoked from resume_user_mode_work() */
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>> -    /* '&' is intentional to spare one conditional branch */
>> -    if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
>> -        __rseq_handle_notify_resume(regs);
>> +    if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>> +        if (current->rseq.event.slowpath)
>> +            __rseq_handle_slowpath(regs);
>> +    } else {
>> +        /* '&' is intentional to spare one conditional branch */
>> +        if (current->rseq.event.sched_switch & current- 
>> >rseq.event.has_rseq)
> 
> Ref. to earlier comment about has_rseq check perhaps redundant.
> 
>> +            __rseq_handle_slowpath(regs);
>> +    }
>>   }
>>   void __rseq_signal_deliver(int sig, struct pt_regs *regs);
>> @@ -142,11 +148,16 @@ static inline void rseq_fork(struct task
>>       } else {
>>           t->rseq = current->rseq;
>>           t->rseq.ids.cpu_cid = ~0ULL;
> 
> As discussed earlier, do we really want to clear cpu_cid here, or
> copy from parent ? If we keep the parent's cached values, I suspect
> we can skip the page fault on return from fork in many cases.
> 
>> +        /*
>> +         * If it has rseq, force it into the slow path right away
>> +         * because it is guaranteed to fault.
>> +         */
>> +        t->rseq.event.slowpath = t->rseq.event.has_rseq;
> 
> I think we can do better here. It's only guaranteed to fault if:
> 
> - has_rseq is set, AND
>    - cpu or cid has changed compared to the cached value OR
>    - rseq_cs user pointer is non-NULL.
> 
> Otherwise we should be able to handle the return from fork from the fast
> path just with loads from the rseq area, or am I missing something ?
> 
> Thanks,
> 
> Mathieu
> 

Just making sure you don't miss one additional comment below...

>>       }
>>   }
>>   #else /* CONFIG_RSEQ */
>> -static inline void rseq_handle_notify_resume(struct ksignal *ksig, 
>> struct pt_regs *regs) { }
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
>>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct 
>> pt_regs *regs) { }
>>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
>>   static inline void rseq_sched_set_task_cpu(struct task_struct *t, 
>> unsigned int cpu) { }
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>>   config DEBUG_RSEQ
>>       default n
>>       bool "Enable debugging of rseq() system call" if EXPERT
>> -    depends on RSEQ && DEBUG_KERNEL
>> +    depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
> 
> I'm confused about this hunk. Perhaps this belongs to a different
> commit ?

^ here.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
>>       select RSEQ_DEBUG_DEFAULT_ENABLE
>>       help
>>         Enable extra debugging checks for the rseq system call.
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -11,13 +11,8 @@
>>   /* Workaround to allow gradual conversion of architecture code */
>>   void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
>> -/**
>> - * exit_to_user_mode_loop - do any pending work before leaving to 
>> user space
>> - * @regs:    Pointer to pt_regs on entry stack
>> - * @ti_work:    TIF work flags as read by the caller
>> - */
>> -__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs 
>> *regs,
>> -                             unsigned long ti_work)
>> +static __always_inline unsigned long __exit_to_user_mode_loop(struct 
>> pt_regs *regs,
>> +                                  unsigned long ti_work)
>>   {
>>       /*
>>        * Before returning to user space ensure that all pending work
>> @@ -62,6 +57,23 @@ void __weak arch_do_signal_or_restart(st
>>       return ti_work;
>>   }
>> +/**
>> + * exit_to_user_mode_loop - do any pending work before leaving to 
>> user space
>> + * @regs:    Pointer to pt_regs on entry stack
>> + * @ti_work:    TIF work flags as read by the caller
>> + */
>> +__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs 
>> *regs,
>> +                             unsigned long ti_work)
>> +{
>> +    for (;;) {
>> +        ti_work = __exit_to_user_mode_loop(regs, ti_work);
>> +
>> +        if (likely(!rseq_exit_to_user_mode_restart(regs)))
>> +            return ti_work;
>> +        ti_work = read_thread_flags();
>> +    }
>> +}
>> +
>>   noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>>   {
>>       irqentry_state_t ret = {
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
>>   static void rseq_slowpath_update_usr(struct pt_regs *regs)
>>   {
>> -    /* Preserve rseq state and user_irq state for exit to user */
>> +    /*
>> +     * Preserve rseq state and user_irq state. The generic entry code
>> +     * clears user_irq on the way out, the non-generic entry
>> +     * architectures are not having user_irq.
>> +     */
>>       const struct rseq_event evt_mask = { .has_rseq = true, .user_irq 
>> = true, };
>>       struct task_struct *t = current;
>>       struct rseq_ids ids;
>> @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
>>       }
>>   }
>> -void __rseq_handle_notify_resume(struct pt_regs *regs)
>> +void __rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>>       /*
>>        * If invoked from hypervisors before entering the guest via
>>
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com