linux-kernel - Re: [PATCH V7 01/11] sched: Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D9C86E6-C4AC-42A6-AB51-5D90FD4FD95D@oracle.com>
Date: Thu, 7 Aug 2025 16:13:37 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "rostedt@...dmis.org"
	<rostedt@...dmis.org>,
        "mathieu.desnoyers@...icios.com"
	<mathieu.desnoyers@...icios.com>,
        "bigeasy@...utronix.de"
	<bigeasy@...utronix.de>,
        "kprateek.nayak@....com" <kprateek.nayak@....com>,
        "vineethr@...ux.ibm.com" <vineethr@...ux.ibm.com>
Subject: Re: [PATCH V7 01/11] sched: Scheduler time slice extension



> On Aug 6, 2025, at 1:34 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> @@ -304,7 +304,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
>>  * exit_to_user_mode_loop - do any pending work before leaving to user space
>>  */
>> unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -     unsigned long ti_work);
>> +     unsigned long ti_work, bool irq);
> 
> I know the kernel-doc already lacks the description for the existing
> arguments, but adding more undocumented ones is not the right thing
> either.
> 
> Also please name this argument 'from_irq' to make it clear what this is
> about.

Ok, will change it.

> 
>> /**
>>  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>> @@ -316,7 +316,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>  *    EXIT_TO_USER_MODE_WORK are set
>>  * 4) check that interrupts are still disabled
>>  */
>> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)
> 
> New argument not documented in kernel-doc.

Will add necessary documentation.

> 
>> {
>> unsigned long ti_work;
>> 
>> @@ -327,7 +327,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> 
>> ti_work = read_thread_flags();
>> if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> - ti_work = exit_to_user_mode_loop(regs, ti_work);
>> + ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
>> +
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && irq)
>> + rseq_delay_resched_arm_timer();
> 
> This is still an unconditional function call which is a NOOP for
> everyone who does not use this. It's not that hard to inline the
> check. How often do I have to explain that?

Will fix.

> 
>> arch_exit_to_user_mode_prepare(regs, ti_work);
>> 
>> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>> 
>> CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>> 
>> + /* Reschedule if scheduler time delay was granted */
> 
> This is not rescheduling. It sets NEED_RESCHED, which is a completely
> different thing.
> 
>> + rseq_delay_set_need_resched();
> 
> I fundamentally hate this hack as it goes out to user space with
> NEED_RESCHED set and absolutely zero debug mechanism which validates
> it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
> so.
> 
> But now this muck comes along and sets the flag, which is semantically
> just wrong and ill defined.
> 
> The point is that NEED_RESCHED has been cleared by requesting and
> granting the extension, which means the task can go out to userspace,
> until it either relinquishes the CPU or hrtick() whacks it over the
> head.
> 
> And your muck requires this insane hack with sched_yield():
> 
>> SYSCALL_DEFINE0(sched_yield)
>> {
>> + /* Reschedule if scheduler time delay was granted */
>> + if (rseq_delay_set_need_resched())
>> + return 0;
>> +
>> do_sched_yield();
>> return 0;
>> }
> 
> That's just completely wrong. Relinquishing the CPU should be possible
> by any arbitrary syscall and not require to make sched_yield() more
> ill-defined as it is already.
> 
> The obvious way to solve both issues is to clear NEED_RESCHED when
> the delay is granted and then do in syscall_enter_from_user_mode_work()
> 
>        rseq_delay_sys_enter()
>        {
>             if (unlikely(current->rseq_delay_resched == GRANTED)) {
>    set_tsk_need_resched(current);
>                    schedule();
>             }       
>        }      
> 
> No?
> 
> It's debatable whether the schedule() there is necessary. Removing it
> would allow the task to either complete the syscall and reschedule on
> exit to user space or go to sleep in the syscall. But that's a trivial
> detail.
> 
> The important point is that the NEED_RESCHED semantics stay sane and the
> problem is solved right on the next syscall entry.
> 
> This delay is not for extending CPU time accross syscalls, it's solely
> to allow user space to complete a _user space_ critical
> section. Everything else is just wrong and we don't implement it as an
> invitation for abuse.
> 
> For the record: I used GRANTED on purpose, because REQUESTED is
> bogus. At the point where __rseq_delay_resched() is invoked _AND_
> observes the user space request, it grants the delay, no?
> 
> This random nomenclature is just making this stuff annoyingly hard to
> follow.
> 

Ok I can move the check to relinquish cpu in syscall_enter_from_user_mode_work()
instead of in syscall_exit_to_user_mode_work(). 


>> +static inline bool rseq_delay_resched(unsigned long ti_work)
>> +{
>> + if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + return false;
>> +
>> + if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
>> + return false;
> 
> Why unlikely? The majority of applications do not use this.

WIll change to likely().

> 
>> +
>> + if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
>> + return false;
> 
> The caller already established that one of these flags is set, no?

That is right, will delete this check here. 

> 
>> + if (__rseq_delay_resched()) {
>> + clear_tsk_need_resched(current);
> 
> Why has this to be inline and is not done in __rseq_delay_resched()?

Sure, it could be in __rseq_delay_resched(). 

> 
>> + return true;
>> + }
>> + return false;
> 
>> /**
>>  * exit_to_user_mode_loop - do any pending work before leaving to user space
>>  * @regs: Pointer to pt_regs on entry stack
>>  * @ti_work: TIF work flags as read by the caller
>>  */
>> __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -     unsigned long ti_work)
>> +     unsigned long ti_work, bool irq)
>> {
> 
> Same comments as above.
> 
>> +
>> +void rseq_delay_resched_tick(void)
>> +{
>> + if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + set_tsk_need_resched(current);
> 
> Small enough to inline into hrtick() with a IS_ENABLED() guard, no?

 I can move it to hrtick() and delete the rseq_delay_resched_tick() routine.

> 
>> +}
>> +#endif /* CONFIG_RSEQ_RESCHED_DELAY */
>> +
>> #ifdef CONFIG_DEBUG_RSEQ
>> 
>> /*
>> @@ -493,6 +527,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = NULL;
>> current->rseq_sig = 0;
>> current->rseq_len = 0;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_NONE;
> 
> What's that conditional for?
> 
> t->rseq_delay_resched is unconditionally available. Your choice of
> optimizing the irrelevant places is amazing.

Will fix.

> 
>> return 0;
>> }
>> 
>> @@ -561,6 +597,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = rseq;
>> current->rseq_len = rseq_len;
>> current->rseq_sig = sig;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> 
> Why is this done unconditionally for rseq?
> 
> So that any rseq user needs to do a function call and a copy_from_user()
> just for nothing?
> 
> A task, which needs this muck, can very well opt-in for this and leave
> everybody else unaffected, no?

Sure, that seems reasonable.

> 
> prctl() exists for a reason and that allows even filtering out the
> request to enable it if the sysadmin sets up filters accordingly.
> 
> As code which wants to utilize this has to be modified anyway, adding
> the prctl() is not a unreasonable requirement.
> 
>> clear_preempt_need_resched();
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
>> +    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> 
> Yet another code conditional for no reason. These are two bits and you can
> use them smart:
> 
> #define ENABLED 1
> #define GRANTED 3
> 
> So you can just go and do
> 
>   prev->rseq_delay_resched &= RSEQ_RESCHED_DELAY_ENABLED;
> 
> which clears the GRANTED bit without a conditional and that's correct
> whether the ENABLED bit was set or not.
> 
> In the syscall exit path you then do:
> 
> static inline bool rseq_delay_resched(void)
> {
>   if (prev->rseq_delay_resched != ENABLED)
>    return false;
>   return __rseq_delay_resched();
> }
> 
> and __rseq_delay_resched() does:
> 
>    rseq_delay_resched = GRANTED;
> 
> No?

That is a nice. I can add a prctl() call to enable & disable this functionality
which would help avoid unnecessary copy_from_user() calls.

In addition to registering the ‘rseq’ struct the application that needs the functionality
will have to make  the prctl() call to enable it, which I think should be reasonable.  

Thanks,
-Prakash.


> 
> Thanks,
> 
>        tglx