linux-kernel - Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250210172059.07cda916@pumpkin>
Date: Mon, 10 Feb 2025 17:20:59 +0000
From: David Laight <david.laight.linux@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Joel Fernandes <joel@...lfernandes.org>, Prakash Sangappa
 <prakash.sangappa@...cle.com>, Peter Zijlstra <peterz@...radead.org>,
 linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, Thomas
 Gleixner <tglx@...utronix.de>, Ankur Arora <ankur.a.arora@...cle.com>,
 Linus Torvalds <torvalds@...ux-foundation.org>, linux-mm@...ck.org,
 x86@...nel.org, Andrew Morton <akpm@...ux-foundation.org>, luto@...nel.org,
 bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, willy@...radead.org,
 mgorman@...e.de, jon.grimm@....com, bharata@....com,
 raghavendra.kt@....com, Boris Ostrovsky <boris.ostrovsky@...cle.com>,
 Konrad Wilk <konrad.wilk@...cle.com>, jgross@...e.com,
 Andrew.Cooper3@...rix.com, Vineeth Pillai <vineethrp@...gle.com>, Suleiman
 Souhlal <suleiman@...gle.com>, Ingo Molnar <mingo@...nel.org>, Mathieu
 Desnoyers <mathieu.desnoyers@...icios.com>, Clark Williams
 <clark.williams@...il.com>, bigeasy@...utronix.de, daniel.wagner@...e.com,
 Joseph Salisbury <joseph.salisbury@...cle.com>, broonie@...il.com
Subject: Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

On Thu, 6 Feb 2025 08:30:39 -0500
Steven Rostedt <rostedt@...dmis.org> wrote:

> On Wed, 5 Feb 2025 22:07:12 -0500
> Joel Fernandes <joel@...lfernandes.org> wrote:
> > >
> > > RT tasks don't have a time slice. They are affected by events. An external
> > > interrupt coming in, or a timer going off that states something is
> > > happening. Perhaps we could use this for SCHED_RR or maybe even
> > > SCHED_DEADLINE, as those do have time slices.
> > >
> > > But if it does get used, it should only be used when the task being
> > > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail
> > > its guarantees.
> > >    
> > 
> > Right, it would apply still to RR/DL though...  
> 
> But it would have to guarantee that the RR it is delaying is of the same
> priority, and that delaying the DL is not going to cause something to miss
> its deadline.
> 
> >   
> > > > In any case, if you want this to only work on FAIR tasks and not RT
> > > > tasks, why is that only possible to do with rseq() + LAZY preemption
> > > > and not Prakash's new API + all preemption modes?
> > > >
> > > > Also you can just ignore RT tasks (not that I'm saying that's a good
> > > > idea but..) in taskshrd_delay_resched() in that patch if you ever
> > > > wanted to do that.
> > > >
> > > > I just feel the RT latency thing is a non-issue AFAICS.    
> > >
> > > Have you worked on any RT projects before?    
> > 
> > Heh.. I think maybe you misunderstood my statement, I was mentioning
> > that I felt (similar to Peter I think) that NOT adopting this feature
> > generically for all tasks due to a concern of 50us latency maybe does
> > not make sense since poorly designed app / random hardware already
> > have this issue. I think the main concern discussed in this thread is
> > (and please CMIIW):  
> 
> We have code that has sub 100us latency and less. If some random user space
> application applies this, adding 50us (or even 20us) will break these. And
> this has nothing to do with poorly designed applications or hardware.
> 
> By adding this as a feature that works everywhere, you will break use cases
> that work today.

Hmmm... you lose big-time anyway.

All you need is a lot of network traffic 'pinch' the process context until
the hardware interrupt, NAPI softint code and rcu softint code completes.
That can easily take several milliseconds.

We managed to get a trace of a SCHED_FIFO task being pre-empted by a
higher priority SCHED_FIFO task.
The chosen target cpu was active running a worker thread.
That got interrupted and ran softint code for several milliseconds.
Other cpu became idle, but the scheduler rather expects to be able
to run RT threads on the cpu it chooses.

The same can happen if an RT thread grabs a mutex for a short time.
All it takes is a hardware interrupt and the mutex hold time goes
through the roof.
You don't need a context switch to hurt you.

The only userspace fix is to replace all the mutex with atomic operations.
(And even they can be griefsome because they are measurable slow.)

	David