linux-kernel - Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a59a1a99b7807d9937e424881c262ba7476d8b6b.camel@gmx.de>
Date: Fri, 01 Nov 2024 19:08:31 +0100
From: Mike Galbraith <efault@....de>
To: Phil Auld <pauld@...hat.com>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org, 
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de,  vschneid@...hat.com, linux-kernel@...r.kernel.org,
 kprateek.nayak@....com,  wuyun.abel@...edance.com,
 youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

On Fri, 2024-11-01 at 10:42 -0400, Phil Auld wrote:
> On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> > On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> >
> > > How is delay dequeue causing more preemption?
> >
> > The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> > until they're picked again. Them getting picked means they're eligible.
> > If at that point they're still not runnable, they're dequeued.
> >
> > By keeping them around like this, they can earn back their lag.
> >
> > The result is that the moment they get woken up again, they're going to
> > be eligible and are considered for preemption.
> >
> >
> > The whole thinking behind this is that while 'lag' measures the
> > mount of service difference from the ideal (positive lag will have less
> > service, while negative lag will have had too much service), this is
> > only true for the (constantly) competing task.
> >
> > The moment a task leaves, will it still have had too much service? And
> > after a few seconds of inactivity?
> >
> > So by keeping the deactivated tasks (artificially) in the competition
> > until they're at least at the equal service point, lets them burn off
> > some of that debt.
> >
> > It is not dissimilar to how CFS had sleeper bonus, except that was
> > walltime based, while this is competition based.
> >
> >
> > Notably, this makes a significant difference for interactive tasks that
> > only run periodically. If they're not eligible at the point of wakeup,
> > they'll incur undue latency.
> >
> >
> > Now, I imagine FIO to have tasks blocking on IO, and while they're
> > blocked, they'll be earning their eligibility, such that when they're
> > woken they're good to go, preempting whatever.
> >
> > Whatever doesn't seem to enjoy this.
> >
> >
> > Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> > whole does like preemption -- so now it's a question of figuring out
> > what exactly it does and doesn't like. Which is never trivial :/
> >
>
> Thanks for that detailed explanation.
>
> I can confirm that FIO does like the preemption
>
> NO_WAKEUP_P and DELAY    - 427 MB/s
> NO_WAKEUP_P and NO_DELAY - 498 MB/s
> WAKEUP_P and DELAY       - 519 MB/s
> WAKEUP_P and NO_DELAY    - 590 MB/s
>
> Something in the delay itself
> (extra tasks in the queue? not migrating the delayed task? ...)

I think it's all about short term fairness and asymmetric buddies.

tbench comparison eevdf vs cfs, 100% apple to apple.

1 tbench buddy pair scheduled cross core.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv

Note 60/47 utilization, now pinned/stacked.

6.1.114-cfs
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
 4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv

Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
it can utilize a full 50%, but it must first preempt wide bottom buddy.

Now eevdf.  (zero source deltas other than eevdf)
6.1.114-eevdf -delay_dequeue
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4988 root      20   0   21424   1948   1736 R 56.44 0.012   0:32.92 3 tbench
 4989 root      20   0    4720    128      0 R 44.55 0.001   0:25.49 3 tbench_srv
6.1.114-eevdf +delay_dequeue
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4934 root      20   0   21424   1952   1736 R 52.00 0.012   0:30.09 3 tbench
 4935 root      20   0    4720    124      0 R 49.00 0.001   0:28.15 3 tbench_srv

As Peter noted, delay_dequeue levels the sleeper playing field.  Both
of these guys are 1:1 sleepers, but they're asymmetric in width.

Bottom line, box full of 1:1 buddies pairing up and stacking in L2.

tbench 8
6.1.114-cfs      3674.37 MB/sec
6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
                 3701.66 MB/sec +delay_dequeue

For tbench, preemption = shorter turnaround = higher throughput.

	-Mike