lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCm5dM4G0D6jSenNQSbpnKkUpOev24Ms02aszPyxyfuuQ@mail.gmail.com>
Date:   Thu, 31 Oct 2019 17:41:39 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Phil Auld <pauld@...hat.com>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Hillf Danton <hdanton@...a.com>,
        Parth Shah <parth@...ux.ibm.com>,
        Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance

On Thu, 31 Oct 2019 at 14:57, Phil Auld <pauld@...hat.com> wrote:
>
> Hi Vincent,
>
> On Wed, Oct 30, 2019 at 06:25:49PM +0100 Vincent Guittot wrote:
> > On Wed, 30 Oct 2019 at 15:39, Phil Auld <pauld@...hat.com> wrote:
> > > > That fact that the 4 nodes works well but not the 8 nodes is a bit
> > > > surprising except if this means more NUMA level in the sched_domain
> > > > topology
> > > > Could you give us more details about the sched domain topology ?
> > > >
> > >
> > > The 8-node system has 5 sched domain levels.  The 4-node system only
> > > has 3.
> >
> > That's an interesting difference. and your additional tests on a 8
> > nodes with 3 level tends to confirm that the number of level make a
> > difference
> > I need to study a bit more how this can impact the spread of tasks
>
> So I think I understand what my numbers have been showing.
>
> I believe the numa balancing is causing problems.
>
> Here's numbers from the test on 5.4-rc3+ without your series:
>
> echo 1 >  /proc/sys/kernel/numa_balancing
> lu.C.x_156_GROUP_1  Average    10.87  0.00   0.00   11.49  36.69  34.26  30.59  32.10
> lu.C.x_156_GROUP_2  Average    20.15  16.32  9.49   24.91  21.07  20.93  21.63  21.50
> lu.C.x_156_GROUP_3  Average    21.27  17.23  11.84  21.80  20.91  20.68  21.11  21.16
> lu.C.x_156_GROUP_4  Average    19.44  6.53   8.71   19.72  22.95  23.16  28.85  26.64
> lu.C.x_156_GROUP_5  Average    20.59  6.20   11.32  14.63  28.73  30.36  22.20  21.98
> lu.C.x_156_NORMAL_1 Average    20.50  19.95  20.40  20.45  18.75  19.35  18.25  18.35
> lu.C.x_156_NORMAL_2 Average    17.15  19.04  18.42  18.69  21.35  21.42  20.00  19.92
> lu.C.x_156_NORMAL_3 Average    18.00  18.15  17.55  17.60  18.90  18.40  19.90  19.75
> lu.C.x_156_NORMAL_4 Average    20.53  20.05  20.21  19.11  19.00  19.47  19.37  18.26
> lu.C.x_156_NORMAL_5 Average    18.72  18.78  19.72  18.50  19.67  19.72  21.11  19.78
>
> ============156_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 1564.63 3003.87 3928.23 5411.13 8386.66
> ============156_GROUP========time====================================
> min     q1      median  q3      max
> 243.12  376.82  519.06  678.79  1303.18
> ============156_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 13845.6 18013.8 18545.5 19359.9 19647.4
> ============156_NORMAL========time====================================
> min     q1      median  q3      max
> 103.78  105.32  109.95  113.19  147.27
>
> (This one above is especially bad... we don't usually see 0.00s, but overall it's
> basically on par. It's reflected in the spread of the results).
>
>
> echo 0 >  /proc/sys/kernel/numa_balancing
> lu.C.x_156_GROUP_1  Average    17.75  19.30  21.20  21.20  20.20  20.80  18.90  16.65
> lu.C.x_156_GROUP_2  Average    18.38  19.25  21.00  20.06  20.19  20.31  19.56  17.25
> lu.C.x_156_GROUP_3  Average    21.81  21.00  18.38  16.86  20.81  21.48  18.24  17.43
> lu.C.x_156_GROUP_4  Average    20.48  20.96  19.61  17.61  17.57  19.74  18.48  21.57
> lu.C.x_156_GROUP_5  Average    23.32  21.96  19.16  14.28  21.44  22.56  17.00  16.28
> lu.C.x_156_NORMAL_1 Average    19.50  19.83  19.58  19.25  19.58  19.42  19.42  19.42
> lu.C.x_156_NORMAL_2 Average    18.90  18.40  20.00  19.80  19.70  19.30  19.80  20.10
> lu.C.x_156_NORMAL_3 Average    19.45  19.09  19.91  20.09  19.45  18.73  19.45  19.82
> lu.C.x_156_NORMAL_4 Average    19.64  19.27  19.64  19.00  19.82  19.55  19.73  19.36
> lu.C.x_156_NORMAL_5 Average    18.75  19.42  20.08  19.67  18.75  19.50  19.92  19.92
>
> ============156_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 14956.3 16346.5 17505.7 18440.6 22492.7
> ============156_GROUP========time====================================
> min     q1      median  q3      max
> 90.65   110.57  116.48  124.74  136.33
> ============156_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 29801.3 30739.2 31967.5 32151.3 34036
> ============156_NORMAL========time====================================
> min     q1      median  q3      max
> 59.91   63.42   63.78   66.33   68.42
>
>
> Note there is a significant improvement already. But we are seeing imbalance due to
> using weighted load and averages. In this case it's only 55% slowdown rather than
> the 5x. But the overall performance if the benchmark is also much better in both cases.
>
>
>
> Here's the same test, same system with the full series (lb_v4a as I've been calling it):
>
> echo 1 >  /proc/sys/kernel/numa_balancing
> lu.C.x_156_GROUP_1  Average    18.59  19.36  19.50  18.86  20.41  20.59  18.27  20.41
> lu.C.x_156_GROUP_2  Average    19.52  20.52  20.48  21.17  19.52  19.09  17.70  18.00
> lu.C.x_156_GROUP_3  Average    20.58  20.71  20.17  20.50  18.46  19.50  18.58  17.50
> lu.C.x_156_GROUP_4  Average    18.95  19.63  19.47  19.84  18.79  19.84  20.84  18.63
> lu.C.x_156_GROUP_5  Average    16.85  17.96  19.89  19.15  19.26  20.48  21.70  20.70
> lu.C.x_156_NORMAL_1 Average    18.04  18.48  20.00  19.72  20.72  20.48  18.48  20.08
> lu.C.x_156_NORMAL_2 Average    18.22  20.56  19.50  19.39  20.67  19.83  18.44  19.39
> lu.C.x_156_NORMAL_3 Average    17.72  19.61  19.56  19.17  20.17  19.89  20.78  19.11
> lu.C.x_156_NORMAL_4 Average    18.05  19.74  20.21  19.89  20.32  20.26  19.16  18.37
> lu.C.x_156_NORMAL_5 Average    18.89  19.95  20.21  20.63  19.84  19.26  19.26  17.95
>
> ============156_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 13460.1 14949   15851.7 16391.4 18993
> ============156_GROUP========time====================================
> min     q1      median  q3      max
> 107.35  124.39  128.63  136.4   151.48
> ============156_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 14418.5 18512.4 19049.5 19682   19808.8
> ============156_NORMAL========time====================================
> min     q1      median  q3      max
> 102.93  103.6   107.04  110.14  141.42
>
>
> echo 0 >  /proc/sys/kernel/numa_balancing
> lu.C.x_156_GROUP_1  Average    19.00  19.33  19.33  19.58  20.08  19.67  19.83  19.17
> lu.C.x_156_GROUP_2  Average    18.55  19.91  20.09  19.27  18.82  19.27  19.91  20.18
> lu.C.x_156_GROUP_3  Average    18.42  19.08  19.75  19.00  19.50  20.08  20.25  19.92
> lu.C.x_156_GROUP_4  Average    18.42  19.83  19.17  19.50  19.58  19.83  19.83  19.83
> lu.C.x_156_GROUP_5  Average    19.17  19.42  20.17  19.92  19.25  18.58  19.92  19.58
> lu.C.x_156_NORMAL_1 Average    19.25  19.50  19.92  18.92  19.33  19.75  19.58  19.75
> lu.C.x_156_NORMAL_2 Average    19.42  19.25  17.83  18.17  19.83  20.50  20.42  20.58
> lu.C.x_156_NORMAL_3 Average    18.58  19.33  19.75  18.25  19.42  20.25  20.08  20.33
> lu.C.x_156_NORMAL_4 Average    19.00  19.55  19.73  18.73  19.55  20.00  19.64  19.82
> lu.C.x_156_NORMAL_5 Average    19.25  19.25  19.50  18.75  19.92  19.58  19.92  19.83
>
> ============156_GROUP========Mop/s===================================
> min     q1      median  q3      max
> 28520.1 29024.2 29042.1 29367.4 31235.2
> ============156_GROUP========time====================================
> min     q1      median  q3      max
> 65.28   69.43   70.21   70.25   71.49
> ============156_NORMAL========Mop/s===================================
> min     q1      median  q3      max
> 28974.5 29806.5 30237.1 30907.4 31830.1
> ============156_NORMAL========time====================================
> min     q1      median  q3      max
> 64.06   65.97   67.43   68.41   70.37
>
>
> This all now makes sense. Looking at the numa balancing code a bit you can see
> that it still uses load so it will still be subject to making bogus decisions
> based on the weighted load. In this case it's been actively working against the
> load balancer because of that.

Thanks for the tests and interesting results

>
> I think with the three numa levels on this system the numa balancing was able to
> win more often.  We don't see the same level of this result on systems with only
> one SD_NUMA level.
>
> Following the other part of this thread, I have to add that I'm of the opinion
> that the weighted load (which is all we have now I believe) really should be used
> only in extreme cases of overload to deal with fairness. And even then maybe not.
> As far as I can see, once the fair group scheduling is involved, that load is
> basically a random number between 1 and 1024.  It really has no bearing on how
> much  "load" a task will put on a cpu.   Any comparison of that to cpu capacity
> is pretty meaningless.
>
> I'm sure there are workloads for which the numa balancing is more important. But
> even then I suspect it is making the wrong decisions more often than not. I think
> a similar rework may be needed :)

Yes , there is probably space for a better collaboration between the
load and numa balancing

>
> I've asked our perf team to try the full battery of tests with numa balancing
> disabled  to see what it shows across the board.
>
>
> Good job on this and thanks for the time looking at my specific issues.

Thanks for your help


>
>
> As far as this series is concerned, and as far as it matters:
>
> Acked-by: Phil Auld <pauld@...hat.com>
>
>
>
> Cheers,
> Phil
>
> --
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ