[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191025133325.GA2421@pauld.bos.csb>
Date: Fri, 25 Oct 2019 09:33:26 -0400
From: Phil Auld <pauld@...hat.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>,
Mel Gorman <mgorman@...hsingularity.net>,
linux-kernel <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Valentin Schneider <valentin.schneider@....com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Quentin Perret <quentin.perret@....com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <Morten.Rasmussen@....com>,
Hillf Danton <hdanton@...a.com>,
Parth Shah <parth@...ux.ibm.com>,
Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance
Hi Vincent,
On Thu, Oct 24, 2019 at 04:59:05PM +0200 Vincent Guittot wrote:
> On Thu, 24 Oct 2019 at 15:47, Phil Auld <pauld@...hat.com> wrote:
> >
> > On Thu, Oct 24, 2019 at 08:38:44AM -0400 Phil Auld wrote:
> > > Hi Vincent,
> > >
> > > On Mon, Oct 21, 2019 at 10:44:20AM +0200 Vincent Guittot wrote:
> > > > On Mon, 21 Oct 2019 at 09:50, Ingo Molnar <mingo@...nel.org> wrote:
> > > > >
>
> [...]
>
> > > > > A full run on Mel Gorman's magic scalability test-suite would be super
> > > > > useful ...
> > > > >
> > > > > Anyway, please be on the lookout for such performance regression reports.
> > > >
> > > > Yes I monitor the regressions on the mailing list
> > >
> > >
> > > Our kernel perf tests show good results across the board for v4.
> > >
> > > The issue we hit on the 8-node system is fixed. Thanks!
> > >
> > > As we didn't see the fairness issue I don't expect the results to be
> > > that different on v4a (with the followup patch) but those tests are
> > > queued up now and we'll see what they look like.
> > >
> >
> > Initial results with fix patch (v4a) show that the outlier issues on
> > the 8-node system have returned. Median time for 152 and 156 threads
> > (160 cpu system) goes up significantly and worst case goes from 340
> > and 250 to 550 sec. for both. And doubles from 150 to 300 for 144
>
> For v3, you had a x4 slow down IIRC.
>
Sorry, that was a confusing change of data point :)
That 4x was the normal versus group result for v3. I.e. the usual
view of this test case's data.
These numbers above are the group vs group difference between
v4 and v4a.
The similar data points are that for v4 there was no difference
in performance between group and normal at 152 threads and a 35%
drop off from normal to group at 156.
With v4a there was 100% drop (2x slowdown) normal to group at 152
and close to that at 156 (~75-80% drop off).
So, yes, not as severe as v3. But significantly off from v4.
>
> > threads. These look more like the results from v3.
>
> OK. For v3, we were not sure that your UC triggers the slow path but
> it seems that we have the confirmation now.
> The problem happens only for this 8 node 160 cores system, isn't it ?
Yes. It only shows up now on this 8-node system.
>
> The fix favors the local group so your UC seems to prefer spreading
> tasks at wake up
> If you have any traces that you can share, this could help to
> understand what's going on. I will try to reproduce the problem on my
> system
I'm not actually sure the fix here is causing this. Looking at the data
more closely I see similar imbalances on v4, v4a and v3.
When you say slow versus fast wakeup paths what do you mean? I'm still
learning my way around all this code.
This particular test is specifically designed to highlight the imbalance
cause by the use of group scheduler defined load and averages. The threads
are mostly CPU bound but will join up every time step. So if each thread
more or less gets its own CPU (we run with fewer threads than CPUs) they
all finish the timestep at about the same time. If threads are stuck
sharing cpus then those finish later and the whole computation is slowed
down. In addition to the NAS benchmark threads there are 2 stress CPU
burners. These are either run in their own cgroups (thus having full "load")
or all in the same cgroup with the benchmarck, thus all having tiny "loads".
In this system, there are 20 cpus per node. We track average number of
benchmark threads running in each node. Generally for a balanced case
we should not have any much over 20 and indeed in the normal case (every
one in one cgroup) we see pretty nice balance. In the cgroup case we are
still seeing numbers much higher than 20.
Here are some eye charts:
This is the GROUP numbers from that machine on the v1 series (I don't have the
NORMAL lines handy for this one):
lu.C.x_152_GROUP_1 Average 18.08 18.17 19.58 19.29 19.25 17.50 21.46 18.67
lu.C.x_152_GROUP_2 Average 17.12 17.48 17.88 17.62 19.57 17.31 23.00 22.02
lu.C.x_152_GROUP_3 Average 17.82 17.97 18.12 18.18 24.55 22.18 16.97 16.21
lu.C.x_152_GROUP_4 Average 18.47 19.08 18.50 18.66 21.45 25.00 15.47 15.37
lu.C.x_152_GROUP_5 Average 20.46 20.71 27.38 24.75 17.06 16.65 12.81 12.19
lu.C.x_156_GROUP_1 Average 18.70 18.80 20.25 19.50 20.45 20.30 19.55 18.45
lu.C.x_156_GROUP_2 Average 19.29 19.90 17.71 18.10 20.76 21.57 19.81 18.86
lu.C.x_156_GROUP_3 Average 25.09 29.19 21.83 21.33 18.67 18.57 11.03 10.29
lu.C.x_156_GROUP_4 Average 18.60 19.10 19.20 18.70 20.30 20.00 19.70 20.40
lu.C.x_156_GROUP_5 Average 18.58 18.95 18.63 18.1 17.32 19.37 23.92 21.08
There are a couple that did not balance well but the overall results were good.
This is v4:
lu.C.x_152_GROUP_1 Average 18.80 19.25 21.95 21.25 17.55 17.25 17.85 18.10
lu.C.x_152_GROUP_2 Average 20.57 20.62 19.76 17.76 18.95 18.33 18.52 17.48
lu.C.x_152_GROUP_3 Average 15.39 12.22 13.96 12.19 25.51 28.91 21.88 21.94
lu.C.x_152_GROUP_4 Average 20.30 19.75 20.75 19.45 18.15 17.80 18.15 17.65
lu.C.x_152_GROUP_5 Average 15.13 12.21 13.63 11.39 25.42 30.21 21.55 22.46
lu.C.x_152_NORMAL_1 Average 17.00 16.88 19.52 18.28 19.24 19.08 21.08 20.92
lu.C.x_152_NORMAL_2 Average 18.61 16.56 18.56 17.00 20.56 20.28 20.00 20.44
lu.C.x_152_NORMAL_3 Average 19.27 19.77 21.23 20.86 18.00 17.68 17.73 17.45
lu.C.x_152_NORMAL_4 Average 20.24 19.33 21.33 21.10 17.33 18.43 17.57 16.67
lu.C.x_152_NORMAL_5 Average 21.27 20.36 20.86 19.36 17.50 17.77 17.32 17.55
lu.C.x_156_GROUP_1 Average 18.60 18.68 21.16 23.40 18.96 19.72 17.76 17.72
lu.C.x_156_GROUP_2 Average 22.76 21.71 20.55 21.32 18.18 16.42 17.58 17.47
lu.C.x_156_GROUP_3 Average 13.62 11.52 15.54 15.58 25.42 28.54 23.22 22.56
lu.C.x_156_GROUP_4 Average 17.73 18.14 21.95 21.82 19.73 19.68 18.55 18.41
lu.C.x_156_GROUP_5 Average 15.32 15.14 17.30 17.11 23.59 25.75 20.77 21.02
lu.C.x_156_NORMAL_1 Average 19.06 18.72 19.56 18.72 19.72 21.28 19.44 19.50
lu.C.x_156_NORMAL_2 Average 20.25 19.86 22.61 23.18 18.32 17.93 16.39 17.46
lu.C.x_156_NORMAL_3 Average 18.84 17.88 19.24 17.76 21.04 20.64 20.16 20.44
lu.C.x_156_NORMAL_4 Average 20.67 19.44 20.74 22.15 18.89 18.85 18.00 17.26
lu.C.x_156_NORMAL_5 Average 20.12 19.65 24.12 24.15 17.40 16.62 17.10 16.83
This one is better overall, but there are some mid 20s abd 152_GROUP_5 is pretty bad.
This is v4a
lu.C.x_152_GROUP_1 Average 28.64 34.49 23.60 24.48 10.35 11.99 8.36 10.09
lu.C.x_152_GROUP_2 Average 17.36 17.33 15.48 13.12 24.90 24.43 18.55 20.83
lu.C.x_152_GROUP_3 Average 20.00 19.92 20.21 21.33 18.50 18.50 16.50 17.04
lu.C.x_152_GROUP_4 Average 18.07 17.87 18.40 17.87 23.07 22.73 17.60 16.40
lu.C.x_152_GROUP_5 Average 25.50 24.69 21.48 21.46 16.85 16.00 14.06 11.96
lu.C.x_152_NORMAL_1 Average 22.27 20.77 20.60 19.83 16.73 17.53 15.83 18.43
lu.C.x_152_NORMAL_2 Average 19.83 20.81 23.06 21.97 17.28 16.92 15.83 16.31
lu.C.x_152_NORMAL_3 Average 17.85 19.31 18.85 19.08 19.00 19.31 19.08 19.54
lu.C.x_152_NORMAL_4 Average 18.87 18.13 19.00 20.27 18.20 18.67 19.73 19.13
lu.C.x_152_NORMAL_5 Average 18.16 18.63 18.11 17.00 19.79 20.63 19.47 20.21
lu.C.x_156_GROUP_1 Average 24.96 26.15 21.78 21.48 18.52 19.11 12.98 11.02
lu.C.x_156_GROUP_2 Average 18.69 19.00 18.65 18.42 20.50 20.46 19.85 20.42
lu.C.x_156_GROUP_3 Average 24.32 23.79 20.82 20.95 16.63 16.61 18.47 14.42
lu.C.x_156_GROUP_4 Average 18.27 18.34 14.88 16.07 27.00 21.93 20.56 18.95
lu.C.x_156_GROUP_5 Average 19.18 20.99 33.43 29.57 15.63 15.54 12.13 9.53
lu.C.x_156_NORMAL_1 Average 21.60 23.37 20.11 19.60 17.11 17.83 18.17 18.20
lu.C.x_156_NORMAL_2 Average 21.00 20.54 19.88 18.79 17.62 18.67 19.29 20.21
lu.C.x_156_NORMAL_3 Average 19.50 19.94 20.12 18.62 19.88 19.50 19.00 19.44
lu.C.x_156_NORMAL_4 Average 20.62 19.72 20.03 22.17 18.21 18.55 18.45 18.24
lu.C.x_156_NORMAL_5 Average 19.64 19.86 21.46 22.43 17.21 17.89 18.96 18.54
This shows much more imblance in the GROUP case. There are some single digits
and some 30s.
For comparison here are some from my 4-node (80 cpu) system:
v4
lu.C.x_76_GROUP_1.ps.numa.hist Average 19.58 17.67 18.25 20.50
lu.C.x_76_GROUP_2.ps.numa.hist Average 19.08 19.17 17.67 20.08
lu.C.x_76_GROUP_3.ps.numa.hist Average 19.42 18.58 18.42 19.58
lu.C.x_76_NORMAL_1.ps.numa.hist Average 20.50 17.33 19.08 19.08
lu.C.x_76_NORMAL_2.ps.numa.hist Average 19.45 18.73 19.27 18.55
v4a
lu.C.x_76_GROUP_1.ps.numa.hist Average 19.46 19.15 18.62 18.77
lu.C.x_76_GROUP_2.ps.numa.hist Average 19.00 18.58 17.75 20.67
lu.C.x_76_GROUP_3.ps.numa.hist Average 19.08 17.08 20.08 19.77
lu.C.x_76_NORMAL_1.ps.numa.hist Average 18.67 18.93 18.60 19.80
lu.C.x_76_NORMAL_2.ps.numa.hist Average 19.08 18.67 18.58 19.67
Nicely balanced in both kernels and normal and group are basically the
same.
There's still something between v1 and v4 on that 8-node system that is
still illustrating the original problem. On our other test systems this
series really works nicely to solve this problem. And even if we can't get
to the bottom if this it's a significant improvement.
Here is v3 for the 8-node system
lu.C.x_152_GROUP_1 Average 17.52 16.86 17.90 18.52 20.00 19.00 22.00 20.19
lu.C.x_152_GROUP_2 Average 15.70 15.04 15.65 15.72 23.30 28.98 20.09 17.52
lu.C.x_152_GROUP_3 Average 27.72 32.79 22.89 22.62 11.01 12.90 12.14 9.93
lu.C.x_152_GROUP_4 Average 18.13 18.87 18.40 17.87 18.80 19.93 20.40 19.60
lu.C.x_152_GROUP_5 Average 24.14 26.46 20.92 21.43 14.70 16.05 15.14 13.16
lu.C.x_152_NORMAL_1 Average 21.03 22.43 20.27 19.97 18.37 18.80 16.27 14.87
lu.C.x_152_NORMAL_2 Average 19.24 18.29 18.41 17.41 19.71 19.00 20.29 19.65
lu.C.x_152_NORMAL_3 Average 19.43 20.00 19.05 20.24 18.76 17.38 18.52 18.62
lu.C.x_152_NORMAL_4 Average 17.19 18.25 17.81 18.69 20.44 19.75 20.12 19.75
lu.C.x_152_NORMAL_5 Average 19.25 19.56 19.12 19.56 19.38 19.38 18.12 17.62
lu.C.x_156_GROUP_1 Average 18.62 19.31 18.38 18.77 19.88 21.35 19.35 20.35
lu.C.x_156_GROUP_2 Average 15.58 12.72 14.96 14.83 20.59 19.35 29.75 28.22
lu.C.x_156_GROUP_3 Average 20.05 18.74 19.63 18.32 20.26 20.89 19.53 18.58
lu.C.x_156_GROUP_4 Average 14.77 11.42 13.01 10.09 27.05 33.52 23.16 22.98
lu.C.x_156_GROUP_5 Average 14.94 11.45 12.77 10.52 28.01 33.88 22.37 22.05
lu.C.x_156_NORMAL_1 Average 20.00 20.58 18.47 18.68 19.47 19.74 19.42 19.63
lu.C.x_156_NORMAL_2 Average 18.52 18.48 18.83 18.43 20.57 20.48 20.61 20.09
lu.C.x_156_NORMAL_3 Average 20.27 20.00 20.05 21.18 19.55 19.00 18.59 17.36
lu.C.x_156_NORMAL_4 Average 19.65 19.60 20.25 20.75 19.35 20.10 19.00 17.30
lu.C.x_156_NORMAL_5 Average 19.79 19.67 20.62 22.42 18.42 18.00 17.67 19.42
I'll try to find pre-patched results for this 8 node system. Just to keep things
together for reference here is the 4-node system before this re-work series.
lu.C.x_76_GROUP_1 Average 15.84 24.06 23.37 12.73
lu.C.x_76_GROUP_2 Average 15.29 22.78 22.49 15.45
lu.C.x_76_GROUP_3 Average 13.45 23.90 22.97 15.68
lu.C.x_76_NORMAL_1 Average 18.31 19.54 19.54 18.62
lu.C.x_76_NORMAL_2 Average 19.73 19.18 19.45 17.64
This produced a 4.5x slowdown for the group runs versus the nicely balance
normal runs.
I can try to get traces but this is not my system so it may take a little
while. I've found that the existing trace points don't give enough information
to see what is happening in this problem. But the visualization in kernelshark
does show the problem pretty well. Do you want just the existing sched tracepoints
or should I update some of the traceprintks I used in the earlier traces?
Cheers,
Phil
>
> >
> > We're re-running the test to get more samples.
>
> Thanks
> Vincent
>
> >
> >
> > Other tests and systems were still fine.
> >
> >
> > Cheers,
> > Phil
> >
> >
> > > Numbers for my specific testcase (the cgroup imbalance) are basically
> > > the same as I posted for v3 (plus the better 8-node numbers). I.e. this
> > > series solves that issue.
> > >
> > >
> > > Cheers,
> > > Phil
> > >
> > >
> > > >
> > > > >
> > > > > Also, we seem to have grown a fair amount of these TODO entries:
> > > > >
> > > > > kernel/sched/fair.c: * XXX borrowed from update_sg_lb_stats
> > > > > kernel/sched/fair.c: * XXX: only do this for the part of runnable > running ?
> > > > > kernel/sched/fair.c: * XXX illustrate
> > > > > kernel/sched/fair.c: } else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
> > > > > kernel/sched/fair.c: * can also include other factors [XXX].
> > > > > kernel/sched/fair.c: * [XXX expand on:
> > > > > kernel/sched/fair.c: * [XXX more?]
> > > > > kernel/sched/fair.c: * [XXX write more on how we solve this.. _after_ merging pjt's patches that
> > > > > kernel/sched/fair.c: * XXX for now avg_load is not computed and always 0 so we
> > > > > kernel/sched/fair.c: /* XXX broken for overlapping NUMA groups */
> > > > >
> > > >
> > > > I will have a look :-)
> > > >
> > > > > :-)
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ingo
> > >
> > > --
> > >
> >
> > --
> >
--
Powered by blists - more mailing lists