linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <59355fae66255a92f2cbc4d7ed38368ff3565140.camel@gmx.de>
Date: Sat, 02 Nov 2024 05:32:14 +0100
From: Mike Galbraith <efault@....de>
To: Phil Auld <pauld@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com, 
 juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com,  rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com,  linux-kernel@...r.kernel.org,
 kprateek.nayak@....com, wuyun.abel@...edance.com, 
 youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On Fri, 2024-11-01 at 16:07 -0400, Phil Auld wrote:


> Thanks for jumping in.  My jargon decoder ring seems to be failing me
> so I'm not completely sure what you are saying below :)
>
> "buddies" you mean tasks that waking each other up and sleeping.
> And one runs for longer than the other, right?

Yeah, buddies are related waker/wakee 1:1 1:N or M:N, excluding tasks
happening to be sitting on a CPU where, say a timer fires, an IRQ leads
to a wakeup of lord knows what, lock wakeups etc etc etc. I think Peter
coined the term buddy to mean that (less typing), and it stuck.

> > 1 tbench buddy pair scheduled cross core.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> > 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
>  
> > Note 60/47 utilization, now pinned/stacked.
> >
> > 6.1.114-cfs
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> >  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
> >  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv
>
> What is the difference between these first two?  The first is on
> separate cores so they don't interfere with each other? And the second is
> pinned to the same core?

Yeah, see 'P'. Given CPU headroom, a tbench pair can consume ~107%.
They're not fully synchronous.. wouldn't be relevant here/now if they
were :)

> > Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> > it can utilize a full 50%, but it must first preempt wide bottom buddy.
> >
>
> We've got "light" and "wide" here which is a bit mixed metaphorically
> :)

Wide, skinny, feather-weight or lard-ball, they all work for me.

> So here CFS is letting the wakee preempt the waker and providing pretty
> equal fairness. And hot l2 caching is masking the assymmetry.

No, it's way simpler: preemption slices through the only thing it can
slice through, the post wakeup concurrent bits.. that otherwise sits
directly in the communication stream as a lump of latency in a latency
bound operation.

>
> With wakeup preemption off it doesn't help in my case. I was thinking
> maybe the preemption was preventing some batching of IO completions
> or
> initiations. But that was wrong it seems.

Dunno.

> Does it also possibly make wakeup migration less likely and thus increase
> stacking?

The buddy being preempted certainly won't be wakeup migrated, because
it won't sleep. Two very sleepy tasks when bw constrained becomes one
100% hog and one 99.99% hog when CPU constrained.

> > Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> >
> > tbench 8
> > 6.1.114-cfs      3674.37 MB/sec
> > 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
> >                  3701.66 MB/sec +delay_dequeue
> >
> > For tbench, preemption = shorter turnaround = higher throughput.
>
> So here you have a benchmark that gets a ~5% boost from
> delayed_dequeue.
>
> But I've got one that get's a 20% penalty so I'm not exactly sure what
> to make of that. Clearly FIO does not have the same pattern as tbench.

There are basically two options in sched-land, shave fastpath cycles,
or some variant of Rob Peter to pay Paul ;-)

	-Mike