linux-kernel - Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YThrUgbbmje_1hRtWzNC5SozirDwhpccZiV=Trhe7HiHw@mail.gmail.com>
Date: Mon, 10 Feb 2025 09:07:27 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Prakash Sangappa <prakash.sangappa@...cle.com>, Peter Zijlstra <peterz@...radead.org>, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	Thomas Gleixner <tglx@...utronix.de>, Ankur Arora <ankur.a.arora@...cle.com>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, linux-mm@...ck.org, x86@...nel.org, 
	Andrew Morton <akpm@...ux-foundation.org>, luto@...nel.org, bp@...en8.de, 
	dave.hansen@...ux.intel.com, hpa@...or.com, juri.lelli@...hat.com, 
	vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de, 
	jon.grimm@....com, bharata@....com, raghavendra.kt@....com, 
	Boris Ostrovsky <boris.ostrovsky@...cle.com>, Konrad Wilk <konrad.wilk@...cle.com>, jgross@...e.com, 
	Andrew.Cooper3@...rix.com, Vineeth Pillai <vineethrp@...gle.com>, 
	Suleiman Souhlal <suleiman@...gle.com>, Ingo Molnar <mingo@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Clark Williams <clark.williams@...il.com>, 
	bigeasy@...utronix.de, daniel.wagner@...e.com, 
	Joseph Salisbury <joseph.salisbury@...cle.com>, broonie@...il.com
Subject: Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

On Thu, Feb 6, 2025 at 8:30 AM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Wed, 5 Feb 2025 22:07:12 -0500
> Joel Fernandes <joel@...lfernandes.org> wrote:
> > >
> > > RT tasks don't have a time slice. They are affected by events. An external
> > > interrupt coming in, or a timer going off that states something is
> > > happening. Perhaps we could use this for SCHED_RR or maybe even
> > > SCHED_DEADLINE, as those do have time slices.
> > >
> > > But if it does get used, it should only be used when the task being
> > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail
> > > its guarantees.
> > >
> >
> > Right, it would apply still to RR/DL though...
>
> But it would have to guarantee that the RR it is delaying is of the same
> priority, and that delaying the DL is not going to cause something to miss
> its deadline.

See Peter comment: "Then pick another number; RT too has a max
scheduling latency number (on some random hardware). If you stay below
that, all is fine.".

> > 3. Overloading the purpose of LAZY: My understanding is, the purpose
> > of LAZY is to let the scheduler decide if it wants to preempt based on
> > preemption mode. It is not based on any hint, just on the preemption
> > mode. I guess you are overloading LAZY by making LAZY flag also extend
> > userspace timeslice (versus say making the time-slice extension hint
> > its own thing...).
>
> I already replied about that. Note, LAZY was created in PREEMPT_RT for this
> very purpose (but in the kernel), and ported to vanilla for a slightly
> different purpose.
>
> Here's the history:
>
>   PREEMPT_RT would convert spin_locks in the kernel to sleeping mutexes.
>
>   This made RT tasks respond much faster to events.
>
>   But non-RT (SCHED_OTHER) started suffering performance issues.
>
>   When looking at the performance issues, we found that it was due to tasks
>   holding these sleeping spin_locks and being preempted. That is, the
>   preemption of holding spin_locks was causing more contention and slowing
>   things down tremendously.
>
>   To first handle this, adaptive mutexes was introduced. These would spin
>   if the owner of the lock was still running, and would go to sleep if the
>   owner goes to sleep. This helped things quite a bit, but PREEMPT_RT was
>   still suffer a performance deficit compared to non-RT.
>
>   This was because of the timer tick on SCHED_OTHER tasks that could
>   preempt a task holding a spin lock.
>
>   NEED_RESCHED_LAZY was introduced to remedy this. It would be set for
>   SCHED_OTHER tasks and NEED_RESCHED for RT tasks. If the task was holding
>   a sleeping spin lock, the NEED_RESCHED_LAZY would not preempt the running
>   task, but NEED_RESCHED would. If the SCHED_OTHER task was not holding a
>   sleeping spin_lock it would be preempted regardless.
>
> This improved the performance of SCHED_OTHER tasks in PREEMPT_RT to be as
> good as what was in vanilla.
>
> You see, LAZY was *created* for this purpose. Of letting the scheduler know
> that the running task is in a critical section and the timer tick should
> not preempt a SCHED_OTHER task.
> I just wanted to extend this to SCHED_OTHER in user space too.

Currently it does not "let anyone know" it is running in a critical
section though. Various paths (update_curr(), wake up) just do a
'lazy' resched until the timer tick has elapsed, or the task returns
to usermode/idle at which point schedule() is called. And it does this
only for FAIR tasks. That can well happen even if the currently
running task is not in a critical section in the kernel at all. Sure,
it may benefit critical sections in the upstream kernel but where is
that explicit?  I still feel we should not overload this in-kernel
mechanism for userspace locking and complicate things.

> > Yes, I have worked on RT projects before --  you would know better
> > than anyone. :-D. But admittedly, I haven't got to work much with
> > PREEMPT_RT systems.
>
> Just using RT policy to improve performance is not an RT project. I'm
> talking about projects that if you miss a deadline things crash. Where the
> project works very hard to make sure everything works as intended.

No no no, I have done way more than applying just the RT policy. So
that means you do not know me that well;-).. I have worked on audio
driver latency, low latency audio, latency issues in vmalloc code,
preempt tracers, irq tracepoints , wake up latency tracers and various
scheduler overhead debug — many of those issues dealt with sub
millisecond latency.. I also dealt with cpu idle issues in the
hardware causing real time latency problems (see my past talks if
interested).  I was partly a hardware engineer when I started my
career and have built circuits. I have Electronics and Computer
engineering degrees.

 - Joel