linux-kernel - Re: [PATCH V2 1/3] Sched: Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250425065357.eiwSSvff@linutronix.de>
Date: Fri, 25 Apr 2025 08:53:57 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Prakash Sangappa <prakash.sangappa@...cle.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>,
	"mathieu.desnoyers@...icios.com" <mathieu.desnoyers@...icios.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>
Subject: Re: [PATCH V2 1/3] Sched: Scheduler time slice extension

On 2025-04-25 01:19:07 [+0000], Prakash Sangappa wrote:
> > On Apr 24, 2025, at 7:13 AM, Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
> > 
> > On 2025-04-18 19:34:08 [+0000], Prakash Sangappa wrote:
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> > …
> >> @@ -930,6 +931,9 @@ struct task_struct {
> >> struct plist_node pushable_tasks;
> >> struct rb_node pushable_dl_tasks;
> >> #endif
> >> +#ifdef CONFIG_RSEQ
> >> + unsigned rseq_sched_delay:1;
> >> +#endif
> > 
> > There should be somewhere a bitfield already which you could use without
> > the ifdef. Then you could use IS_ENABLED() if you want to save some code
> > if RSEQ is not enabled.
> 
> I suppose we could. 
> Patch 1 is pretty much what PeterZ posted, hope he will comment on it.

If it is "pretty much what PeterZ posted" why did he not receive any
credit for it?

> Could it be moved below here, call it sched_time_delay, or some variant of this name?

I don't mind the name. The point is to add to an existing group instead
of starting a new u32 bit field.

> struct task_struct {
> ..
> #ifdef CONFIG_TASK_DELAY_ACCT
>         /* delay due to memory thrashing */
>         unsigned                        in_thrashing:1;
> #endif
>         unsigned                        sched_time_delay:1;
> ..
> }
> 
> This field will be for scheduler time extension use only. Mainly updated in the context of the current thread. 
> Do we even need to use IS_ENABLED(CONFIG_RSEQ) to access?

Well, if you want to avoid the code in the !CONFIG_RSEQ then yes.

> >> struct mm_struct *mm;
> >> struct mm_struct *active_mm;
> >> --- a/include/uapi/linux/rseq.h
> >> +++ b/include/uapi/linux/rseq.h
> > …
> >> @@ -128,6 +131,8 @@ struct rseq {
> >> * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> >> *     Inhibit instruction sequence block restart on migration for
> >> *     this thread.
> >> + * - RSEQ_CS_DELAY_RESCHED
> >> + *     Try delay resched...
> > 
> > Delay resched up to $time for $kind-of-stats under $conditions.
> 
> Will add some comment like
>  “Delay resched for upto 50us.  Checked when thread is about to be preempted"
> 
> With the tunable added in the subsequent patch, will change ‘50us' it to the tunable name.

A proper descritption of the flag would nice. The current state is that
I can derive move from the constant than from the comment.

> >> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> >> index 6b7ff1bc1b9b..944027d14198 100644
> >> --- a/kernel/entry/common.c
> >> +++ b/kernel/entry/common.c
> > …
> >> @@ -99,8 +100,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >> 
> >> local_irq_enable_exit_to_user(ti_work);
> >> 
> >> - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> >> - schedule();
> >> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
> > 
> > couldn't we restrict this to _TIF_NEED_RESCHED_LAZY? That way we would
> > still schedule immediately for any SCHED_FIFO/RR/DL tasks and do this
> > delay only for everything else such as SCHED_OTHER/…
> 
> 
> Wasn’t this the entire discussion about whether to limit it to SCHE_OTHER or not?
> Will defer it to Peter.

Oh. But this still deserves a trace point for this manoeuvre. A trace
would show you a wakeup, the need-resched bit will be shown and then it
will vanish later and people might wonder where did it go.

> > 
> >> +       if (irq && rseq_delay_resched())
> >> +       clear_tsk_need_resched(current);
> >> +       else
> >> +       schedule();
> >> + }
> >> 
> >> if (ti_work & _TIF_UPROBE)
> >> uprobe_notify_resume(regs);
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 165c90ba64ea..cee50e139723 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -823,6 +823,7 @@ void update_rq_clock(struct rq *rq)
> >> 
> >> static void hrtick_clear(struct rq *rq)
> >> {
> >> + rseq_delay_resched_tick();
> > 
> > This is called from __schedule(). If you set the need-resched flag here,
> > it gets removed shortly after. Do I miss something?
> 
> hrtick_clear() is also called when the cpu is being removed in sched_cpu_dying().
> We need to set resched there?

Do we? My understanding is that the NEED_RESCHED flag gets removed once
and then RSEQ_CS_DELAY_RESCHED gets set. RSEQ_CS_DELAY_RESCHED in turn
gets cleared in the scheduler once task leaves the CPU. Once the task
left the CPU then there is no need to set the bit. The sched_cpu_dying()
is the HP thread so if that one is on the CPU then the user task is
gone.

How does this delay thingy work with HZ=100 vs HZ=1000? Like what is the
most you could get in extra time? I could imagine that if a second task
gets on the runqueue and you skip the wake up but the runtime is used up
then the HZ tick should set NEED_RESCHED again and the following HZ tick
should force the schedule point.

> Thanks for your comments.
> -Prakash.

Sebastian