linux-kernel - Re: [PATCH V2 1/3] Sched: Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <A61407ED-56DA-4462-9592-0BAF6FE9E1DB@oracle.com>
Date: Fri, 25 Apr 2025 18:55:17 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "rostedt@...dmis.org"
	<rostedt@...dmis.org>,
        "mathieu.desnoyers@...icios.com"
	<mathieu.desnoyers@...icios.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>
Subject: Re: [PATCH V2 1/3] Sched: Scheduler time slice extension



> On Apr 24, 2025, at 11:53 PM, Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
> 
> On 2025-04-25 01:19:07 [+0000], Prakash Sangappa wrote:
>>> On Apr 24, 2025, at 7:13 AM, Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
>>> 
>>> On 2025-04-18 19:34:08 [+0000], Prakash Sangappa wrote:
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>> …
>>>> @@ -930,6 +931,9 @@ struct task_struct {
>>>> struct plist_node pushable_tasks;
>>>> struct rb_node pushable_dl_tasks;
>>>> #endif
>>>> +#ifdef CONFIG_RSEQ
>>>> + unsigned rseq_sched_delay:1;
>>>> +#endif
>>> 
>>> There should be somewhere a bitfield already which you could use without
>>> the ifdef. Then you could use IS_ENABLED() if you want to save some code
>>> if RSEQ is not enabled.
>> 
>> I suppose we could. 
>> Patch 1 is pretty much what PeterZ posted, hope he will comment on it.
> 
> If it is "pretty much what PeterZ posted" why did he not receive any
> credit for it?

Sure, he gets credit.
I have included ’Suggested-by’ in the patch. Will change to Signed-Off-by, if that is what it should be. 


> 
>> Could it be moved below here, call it sched_time_delay, or some variant of this name?
> 
> I don't mind the name. The point is to add to an existing group instead
> of starting a new u32 bit field.

Ok,  I think this place after ‘in_thrashing' would serve that purpose.

> 
>> struct task_struct {
>> ..
>> #ifdef CONFIG_TASK_DELAY_ACCT
>>        /* delay due to memory thrashing */
>>        unsigned                        in_thrashing:1;
>> #endif
>>        unsigned                        sched_time_delay:1;
>> ..
>> }
>> 
>> This field will be for scheduler time extension use only. Mainly updated in the context of the current thread. 
>> Do we even need to use IS_ENABLED(CONFIG_RSEQ) to access?
> 
> Well, if you want to avoid the code in the !CONFIG_RSEQ then yes.

Sure, I can include IS_ENABLED() check when accessing. 

> 
>>>> struct mm_struct *mm;
>>>> struct mm_struct *active_mm;
>>>> --- a/include/uapi/linux/rseq.h
>>>> +++ b/include/uapi/linux/rseq.h
>>> …
>>>> @@ -128,6 +131,8 @@ struct rseq {
>>>> * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>>>> *     Inhibit instruction sequence block restart on migration for
>>>> *     this thread.
>>>> + * - RSEQ_CS_DELAY_RESCHED
>>>> + *     Try delay resched...
>>> 
>>> Delay resched up to $time for $kind-of-stats under $conditions.
>> 
>> Will add some comment like
>> “Delay resched for upto 50us.  Checked when thread is about to be preempted"
>> 
>> With the tunable added in the subsequent patch, will change ‘50us' it to the tunable name.
> 
> A proper descritption of the flag would nice. The current state is that
> I can derive move from the constant than from the comment.

Will modify.

> 
>>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>>> index 6b7ff1bc1b9b..944027d14198 100644
>>>> --- a/kernel/entry/common.c
>>>> +++ b/kernel/entry/common.c
>>> …
>>>> @@ -99,8 +100,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>>> 
>>>> local_irq_enable_exit_to_user(ti_work);
>>>> 
>>>> - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>>>> - schedule();
>>>> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
>>> 
>>> couldn't we restrict this to _TIF_NEED_RESCHED_LAZY? That way we would
>>> still schedule immediately for any SCHED_FIFO/RR/DL tasks and do this
>>> delay only for everything else such as SCHED_OTHER/…
>> 
>> 
>> Wasn’t this the entire discussion about whether to limit it to SCHE_OTHER or not?
>> Will defer it to Peter.
> 
> Oh. But this still deserves a trace point for this manoeuvre. A trace
> would show you a wakeup, the need-resched bit will be shown and then it
> will vanish later and people might wonder where did it go.

I can  look at adding a tracepoint here to indicate a delay was granted and need-resched bit got cleared.


> 
>>> 
>>>> +       if (irq && rseq_delay_resched())
>>>> +       clear_tsk_need_resched(current);
>>>> +       else
>>>> +       schedule();
>>>> + }
>>>> 
>>>> if (ti_work & _TIF_UPROBE)
>>>> uprobe_notify_resume(regs);
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 165c90ba64ea..cee50e139723 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -823,6 +823,7 @@ void update_rq_clock(struct rq *rq)
>>>> 
>>>> static void hrtick_clear(struct rq *rq)
>>>> {
>>>> + rseq_delay_resched_tick();
>>> 
>>> This is called from __schedule(). If you set the need-resched flag here,
>>> it gets removed shortly after. Do I miss something?
>> 
>> hrtick_clear() is also called when the cpu is being removed in sched_cpu_dying().
>> We need to set resched there?
> 
> Do we? My understanding is that the NEED_RESCHED flag gets removed once
> and then RSEQ_CS_DELAY_RESCHED gets set. RSEQ_CS_DELAY_RESCHED in turn
> gets cleared in the scheduler once task leaves the CPU. Once the task
> left the CPU then there is no need to set the bit. The sched_cpu_dying()
> is the HP thread so if that one is on the CPU then the user task is
> gone.

Ok, will remove this call from hrtick_clear()

To clarify, RSEQ_CS_DELAY_RESCHED bit usage is that it is set by user task in the rseq structure, to request additional time when entering the critical section in user space.
When kernel sees this set in exit_to_user_mode_loop(), when the task is about to be rescheduled, it clears the RSEQ_CS_DELAY_RESCHED bit in the ‘rseq’ structure and sets the t->rseq_sched_delay=1 in task_struct structure and the hrtick 50us timer is started. 

When the timer fires, if the task is still running(ie. has t->rseq_sched_delay flag set), the task will be rescheduled. In __schedule() when the task leaves the cpu, it will clear t->rseq_sched_delay flag.


> 
> How does this delay thingy work with HZ=100 vs HZ=1000? Like what is the
> most you could get in extra time? I could imagine that if a second task
> gets on the runqueue and you skip the wake up but the runtime is used up
> then the HZ tick should set NEED_RESCHED again and the following HZ tick
> should force the schedule point.

The most a task can get is 50us(or tunable value from patch 2) extra time. Yes, if runtime expires within the 50us extension that was granted as part of a wakeup, it will get rescheduled as NEED_RESCHED is set by HZ tick which could result in shorter the 50us extra time.  Why do you say following HZ tick?, in that case the 50us timer would reschedule. 

If the 50us extension was granted at the end of the runtime due to HZ tick, it will get rescheduled when the 50us timer fires.

Also, if a wakeup occurs and the task running has the 50us extension(rseq_sched_delay is set), it will again get rescheduled by resched_curr() if it is TIF_NEED_SCHED.

If the task on the cpu has t->rseq_sched_delay set, should we consider avoiding force reschedule in the above scenario(both runtime expire and wakeup), as the task would be rescheduled by the 50us timer?
 
Prakash

> 
>> Thanks for your comments.
>> -Prakash.
> 
> Sebastian