[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230501082536.GA1597476@hirez.programming.kicks-ass.net>
Date: Mon, 1 May 2023 10:25:36 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Mike Galbraith <efault@....de>
Cc: Chen Yu <yu.c.chen@...el.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Tim Chen <tim.c.chen@...el.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Honglei Wang <wanghonglei@...ichuxing.com>,
Len Brown <len.brown@...el.com>,
Chen Yu <yu.chen.surf@...il.com>,
Tianchen Ding <dtcccc@...ux.alibaba.com>,
Joel Fernandes <joel@...lfernandes.org>,
Josh Don <joshdon@...gle.com>,
kernel test robot <yujie.liu@...el.com>,
Arjan Van De Ven <arjan.van.de.ven@...el.com>,
Aaron Lu <aaron.lu@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up
short task on current CPU
On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> > ^
> > throughput|
> > | X
> > | X X X
> > | X X X
> > | X X
> > | X X
> > | X
> > | X
> > | X
> > | X
> > |
> > +-----------------.------------------->
> > 56
> > number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
> Anyway..
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.
I'm thinking WA_BIAS makes this worse...
Powered by blists - more mailing lists