linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241101200704.GE689589@pauld.westford.csb>
Date: Fri, 1 Nov 2024 16:07:04 -0400
From: Phil Auld <pauld@...hat.com>
To: Mike Galbraith <efault@....de>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue


Hi Mike,

On Fri, Nov 01, 2024 at 07:08:31PM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-01 at 10:42 -0400, Phil Auld wrote:
> > On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> > > On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> > >
> > > > How is delay dequeue causing more preemption?
> > >
> > > The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> > > until they're picked again. Them getting picked means they're eligible.
> > > If at that point they're still not runnable, they're dequeued.
> > >
> > > By keeping them around like this, they can earn back their lag.
> > >
> > > The result is that the moment they get woken up again, they're going to
> > > be eligible and are considered for preemption.
> > >
> > >
> > > The whole thinking behind this is that while 'lag' measures the
> > > mount of service difference from the ideal (positive lag will have less
> > > service, while negative lag will have had too much service), this is
> > > only true for the (constantly) competing task.
> > >
> > > The moment a task leaves, will it still have had too much service? And
> > > after a few seconds of inactivity?
> > >
> > > So by keeping the deactivated tasks (artificially) in the competition
> > > until they're at least at the equal service point, lets them burn off
> > > some of that debt.
> > >
> > > It is not dissimilar to how CFS had sleeper bonus, except that was
> > > walltime based, while this is competition based.
> > >
> > >
> > > Notably, this makes a significant difference for interactive tasks that
> > > only run periodically. If they're not eligible at the point of wakeup,
> > > they'll incur undue latency.
> > >
> > >
> > > Now, I imagine FIO to have tasks blocking on IO, and while they're
> > > blocked, they'll be earning their eligibility, such that when they're
> > > woken they're good to go, preempting whatever.
> > >
> > > Whatever doesn't seem to enjoy this.
> > >
> > >
> > > Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> > > whole does like preemption -- so now it's a question of figuring out
> > > what exactly it does and doesn't like. Which is never trivial :/
> > >
> >
> > Thanks for that detailed explanation.
> >
> > I can confirm that FIO does like the preemption
> >
> > NO_WAKEUP_P and DELAY    - 427 MB/s
> > NO_WAKEUP_P and NO_DELAY - 498 MB/s
> > WAKEUP_P and DELAY       - 519 MB/s
> > WAKEUP_P and NO_DELAY    - 590 MB/s
> >
> > Something in the delay itself
> > (extra tasks in the queue? not migrating the delayed task? ...)
> 
> I think it's all about short term fairness and asymmetric buddies.

Thanks for jumping in.  My jargon decoder ring seems to be failing me
so I'm not completely sure what you are saying below :)

"buddies" you mean tasks that waking each other up and sleeping.
And one runs for longer than the other, right?

> 
> tbench comparison eevdf vs cfs, 100% apple to apple.
> 
> 1 tbench buddy pair scheduled cross core.
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
 
> Note 60/47 utilization, now pinned/stacked.
> 
> 6.1.114-cfs
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
>  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv

What is the difference between these first two?  The first is on
separate cores so they don't interfere with each other? And the second is
pinned to the same core?

>
> Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> it can utilize a full 50%, but it must first preempt wide bottom buddy.
>

We've got "light" and "wide" here which is a bit mixed metaphorically :) 
So here CFS is letting the wakee preempt the waker and providing pretty
equal fairness. And hot l2 caching is masking the assymmetry. 

> Now eevdf.  (zero source deltas other than eevdf)
> 6.1.114-eevdf -delay_dequeue
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4988 root      20   0   21424   1948   1736 R 56.44 0.012   0:32.92 3 tbench
>  4989 root      20   0    4720    128      0 R 44.55 0.001   0:25.49 3 tbench_srv
> 6.1.114-eevdf +delay_dequeue
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4934 root      20   0   21424   1952   1736 R 52.00 0.012   0:30.09 3 tbench
>  4935 root      20   0    4720    124      0 R 49.00 0.001   0:28.15 3 tbench_srv
> 
> As Peter noted, delay_dequeue levels the sleeper playing field.  Both
> of these guys are 1:1 sleepers, but they're asymmetric in width.

With wakeup preemption off it doesn't help in my case. I was thinking
maybe the preemption was preventing some batching of IO completions or
initiations. But that was wrong it seems. 

Does it also possibly make wakeup migration less likely and thus increase
stacking?  

> Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> 
> tbench 8
> 6.1.114-cfs      3674.37 MB/sec
> 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
>                  3701.66 MB/sec +delay_dequeue
> 
> For tbench, preemption = shorter turnaround = higher throughput.

So here you have a benchmark that gets a ~5% boost from delayed_dequeue.

But I've got one that get's a 20% penalty so I'm not exactly sure what
to make of that. Clearly FIO does not have the same pattern as tbench. 

It's not a special case though, this is one that our perf team runs
regularly to look for regressions. 

I'll be able to poke at it more next week so hopefully I can see what it's
doing. 


Cheers,
Phil


> 
> 	-Mike
> 

--