linux-kernel - Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230501082536.GA1597476@hirez.programming.kicks-ass.net>
Date:   Mon, 1 May 2023 10:25:36 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mike Galbraith <efault@....de>
Cc:     Chen Yu <yu.c.chen@...el.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Honglei Wang <wanghonglei@...ichuxing.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>,
        kernel test robot <yujie.liu@...el.com>,
        Arjan Van De Ven <arjan.van.de.ven@...el.com>,
        Aaron Lu <aaron.lu@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up
 short task on current CPU

On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> >           ^
> > throughput|
> >           |                 X
> >           |               X   X X
> >           |             X         X X
> >           |           X               X
> >           |         X                   X
> >           |       X
> >           |     X
> >           |   X
> >           | X
> >           |
> >           +-----------------.------------------->
> >                             56
> >                                  number of instances
> 
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
> 
> Anyway..
> 
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly.  Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
> 
> ...
>    8   3155749  3606.79 MB/sec  warmup  38 sec  latency 3.852 ms
>    8   3238485  3608.75 MB/sec  warmup  39 sec  latency 3.839 ms
>    8   3321578  3608.59 MB/sec  warmup  40 sec  latency 3.882 ms
>    8   3404746  3608.09 MB/sec  warmup  41 sec  latency 2.273 ms
>    8   3487885  3607.58 MB/sec  warmup  42 sec  latency 3.869 ms
>    8   3571034  3607.12 MB/sec  warmup  43 sec  latency 3.855 ms
>    8   3654067  3607.48 MB/sec  warmup  44 sec  latency 3.857 ms
>    8   3736973  3608.83 MB/sec  warmup  45 sec  latency 4.008 ms
>    8   3820160  3608.33 MB/sec  warmup  46 sec  latency 3.849 ms
>    8   3902963  3607.60 MB/sec  warmup  47 sec  latency 14.241 ms
>    8   3986117  3607.17 MB/sec  warmup  48 sec  latency 20.290 ms
>    8   4069256  3606.70 MB/sec  warmup  49 sec  latency 28.284 ms
>    8   4151986  3608.35 MB/sec  warmup  50 sec  latency 17.216 ms
>    8   4235070  3608.06 MB/sec  warmup  51 sec  latency 23.221 ms
>    8   4318221  3607.81 MB/sec  warmup  52 sec  latency 28.285 ms
>    8   4401456  3607.29 MB/sec  warmup  53 sec  latency 20.835 ms
>    8   4484606  3607.06 MB/sec  warmup  54 sec  latency 28.943 ms
>    8   4567609  3607.32 MB/sec  warmup  55 sec  latency 28.254 ms
> 
> Where I turned it on is hard to miss.
> 
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure.  A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything.  However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.

I'm thinking WA_BIAS makes this worse...