[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241101200704.GE689589@pauld.westford.csb>
Date: Fri, 1 Nov 2024 16:07:04 -0400
From: Phil Auld <pauld@...hat.com>
To: Mike Galbraith <efault@....de>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
kprateek.nayak@....com, wuyun.abel@...edance.com,
youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
Hi Mike,
On Fri, Nov 01, 2024 at 07:08:31PM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-01 at 10:42 -0400, Phil Auld wrote:
> > On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> > > On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> > >
> > > > How is delay dequeue causing more preemption?
> > >
> > > The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> > > until they're picked again. Them getting picked means they're eligible.
> > > If at that point they're still not runnable, they're dequeued.
> > >
> > > By keeping them around like this, they can earn back their lag.
> > >
> > > The result is that the moment they get woken up again, they're going to
> > > be eligible and are considered for preemption.
> > >
> > >
> > > The whole thinking behind this is that while 'lag' measures the
> > > mount of service difference from the ideal (positive lag will have less
> > > service, while negative lag will have had too much service), this is
> > > only true for the (constantly) competing task.
> > >
> > > The moment a task leaves, will it still have had too much service? And
> > > after a few seconds of inactivity?
> > >
> > > So by keeping the deactivated tasks (artificially) in the competition
> > > until they're at least at the equal service point, lets them burn off
> > > some of that debt.
> > >
> > > It is not dissimilar to how CFS had sleeper bonus, except that was
> > > walltime based, while this is competition based.
> > >
> > >
> > > Notably, this makes a significant difference for interactive tasks that
> > > only run periodically. If they're not eligible at the point of wakeup,
> > > they'll incur undue latency.
> > >
> > >
> > > Now, I imagine FIO to have tasks blocking on IO, and while they're
> > > blocked, they'll be earning their eligibility, such that when they're
> > > woken they're good to go, preempting whatever.
> > >
> > > Whatever doesn't seem to enjoy this.
> > >
> > >
> > > Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> > > whole does like preemption -- so now it's a question of figuring out
> > > what exactly it does and doesn't like. Which is never trivial :/
> > >
> >
> > Thanks for that detailed explanation.
> >
> > I can confirm that FIO does like the preemption
> >
> > NO_WAKEUP_P and DELAY - 427 MB/s
> > NO_WAKEUP_P and NO_DELAY - 498 MB/s
> > WAKEUP_P and DELAY - 519 MB/s
> > WAKEUP_P and NO_DELAY - 590 MB/s
> >
> > Something in the delay itself
> > (extra tasks in the queue? not migrating the delayed task? ...)
>
> I think it's all about short term fairness and asymmetric buddies.
Thanks for jumping in. My jargon decoder ring seems to be failing me
so I'm not completely sure what you are saying below :)
"buddies" you mean tasks that waking each other up and sleeping.
And one runs for longer than the other, right?
>
> tbench comparison eevdf vs cfs, 100% apple to apple.
>
> 1 tbench buddy pair scheduled cross core.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 13770 root 20 0 21424 1920 1792 S 60.13 0.012 0:33.81 3 tbench
> 13771 root 20 0 4720 896 768 S 46.84 0.006 0:26.10 2 tbench_srv
> Note 60/47 utilization, now pinned/stacked.
>
> 6.1.114-cfs
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 4407 root 20 0 21424 1980 1772 R 50.00 0.012 0:29.20 3 tbench
> 4408 root 20 0 4720 124 0 R 50.00 0.001 0:28.76 3 tbench_srv
What is the difference between these first two? The first is on
separate cores so they don't interfere with each other? And the second is
pinned to the same core?
>
> Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> it can utilize a full 50%, but it must first preempt wide bottom buddy.
>
We've got "light" and "wide" here which is a bit mixed metaphorically :)
So here CFS is letting the wakee preempt the waker and providing pretty
equal fairness. And hot l2 caching is masking the assymmetry.
> Now eevdf. (zero source deltas other than eevdf)
> 6.1.114-eevdf -delay_dequeue
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 4988 root 20 0 21424 1948 1736 R 56.44 0.012 0:32.92 3 tbench
> 4989 root 20 0 4720 128 0 R 44.55 0.001 0:25.49 3 tbench_srv
> 6.1.114-eevdf +delay_dequeue
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 4934 root 20 0 21424 1952 1736 R 52.00 0.012 0:30.09 3 tbench
> 4935 root 20 0 4720 124 0 R 49.00 0.001 0:28.15 3 tbench_srv
>
> As Peter noted, delay_dequeue levels the sleeper playing field. Both
> of these guys are 1:1 sleepers, but they're asymmetric in width.
With wakeup preemption off it doesn't help in my case. I was thinking
maybe the preemption was preventing some batching of IO completions or
initiations. But that was wrong it seems.
Does it also possibly make wakeup migration less likely and thus increase
stacking?
> Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
>
> tbench 8
> 6.1.114-cfs 3674.37 MB/sec
> 6.1.114-eevdf 3505.25 MB/sec -delay_dequeue
> 3701.66 MB/sec +delay_dequeue
>
> For tbench, preemption = shorter turnaround = higher throughput.
So here you have a benchmark that gets a ~5% boost from delayed_dequeue.
But I've got one that get's a 20% penalty so I'm not exactly sure what
to make of that. Clearly FIO does not have the same pattern as tbench.
It's not a special case though, this is one that our perf team runs
regularly to look for regressions.
I'll be able to poke at it more next week so hopefully I can see what it's
doing.
Cheers,
Phil
>
> -Mike
>
--
Powered by blists - more mailing lists