linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241104130515.GB749675@pauld.westford.csb>
Date: Mon, 4 Nov 2024 08:05:15 -0500
From: Phil Auld <pauld@...hat.com>
To: Mike Galbraith <efault@....de>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-01 at 16:07 -0400, Phil Auld wrote:
> 
> 
> > Thanks for jumping in.  My jargon decoder ring seems to be failing me
> > so I'm not completely sure what you are saying below :)
> >
> > "buddies" you mean tasks that waking each other up and sleeping.
> > And one runs for longer than the other, right?
> 
> Yeah, buddies are related waker/wakee 1:1 1:N or M:N, excluding tasks
> happening to be sitting on a CPU where, say a timer fires, an IRQ leads
> to a wakeup of lord knows what, lock wakeups etc etc etc. I think Peter
> coined the term buddy to mean that (less typing), and it stuck.
>

Thanks!

> > > 1 tbench buddy pair scheduled cross core.
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > > 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> > > 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
> >  
> > > Note 60/47 utilization, now pinned/stacked.
> > >
> > > 6.1.114-cfs
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > >  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
> > >  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv
> >
> > What is the difference between these first two?  The first is on
> > separate cores so they don't interfere with each other? And the second is
> > pinned to the same core?
> 
> Yeah, see 'P'. Given CPU headroom, a tbench pair can consume ~107%.
> They're not fully synchronous.. wouldn't be relevant here/now if they
> were :)
> 
> > > Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> > > it can utilize a full 50%, but it must first preempt wide bottom buddy.
> > >
> >
> > We've got "light" and "wide" here which is a bit mixed metaphorically
> > :)
> 
> Wide, skinny, feather-weight or lard-ball, they all work for me.
> 
> > So here CFS is letting the wakee preempt the waker and providing pretty
> > equal fairness. And hot l2 caching is masking the assymmetry.
> 
> No, it's way simpler: preemption slices through the only thing it can
> slice through, the post wakeup concurrent bits.. that otherwise sits
> directly in the communication stream as a lump of latency in a latency
> bound operation.
> 
> >
> > With wakeup preemption off it doesn't help in my case. I was thinking
> > maybe the preemption was preventing some batching of IO completions
> > or
> > initiations. But that was wrong it seems.
> 
> Dunno.
> 
> > Does it also possibly make wakeup migration less likely and thus increase
> > stacking?
> 
> The buddy being preempted certainly won't be wakeup migrated, because
> it won't sleep. Two very sleepy tasks when bw constrained becomes one
> 100% hog and one 99.99% hog when CPU constrained.
>

Not the waker who gets preempted but the wakee may be a bit more
sticky on his current cpu and thus stack more since he's still
in that runqueue. But that's just a mental excercise trying to
find things that are directly related to delay dequeue. No observation
other than the over all perf hit. 


> > > Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> > >
> > > tbench 8
> > > 6.1.114-cfs      3674.37 MB/sec
> > > 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
> > >                  3701.66 MB/sec +delay_dequeue
> > >
> > > For tbench, preemption = shorter turnaround = higher throughput.
> >
> > So here you have a benchmark that gets a ~5% boost from
> > delayed_dequeue.
> >
> > But I've got one that get's a 20% penalty so I'm not exactly sure what
> > to make of that. Clearly FIO does not have the same pattern as tbench.
> 
> There are basically two options in sched-land, shave fastpath cycles,
> or some variant of Rob Peter to pay Paul ;-)
>

That Peter is cranky :)

Cheers,
Phil


> 	-Mike
> 

--