linux-kernel - Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAE4VaGALRsggp4Z3g01y1v=q5A=8OqGLuwOeD8oj7P-BkdWNBw@mail.gmail.com>
Date:   Thu, 12 Mar 2020 18:06:51 +0100
From:   Jirka Hladky <jhladky@...hat.com>
To:     Phil Auld <pauld@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Hillf Danton <hdanton@...a.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load
 balancer v6

> find it unlikely that is common because who acquires such a large machine
> and then uses a tiny percentage of it.

I generally agree, but I also want to make a point that AMD made these
large systems much more affordable with their EPYC CPUs. The 8 NUMA
node server we are using costs under $8k.

>
> This is somewhat of a dilemma. Without the series, the load balancer and
> NUMA balancer use very different criteria on what should happen and
> results are not stable.

Unfortunately, I see instabilities also for the series. This is again
for the sp_C test with 8 threads executed on dual-socket AMD 7351
(EPYC Naples) server with 8 NUMA nodes. With the series applied, the
runtime varies from 86 to 165 seconds! Could we do something about it?
The runtime of 86 seconds would be acceptable. If we could stabilize
this case and get consistent runtime around 80 seconds, the problem
would be gone.

Do you experience the similar instability of results on your HW for
sp_C with low thread counts?

Runtime with this series applied:
 $ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
      125.73
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
       87.54
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
       86.93
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
      165.98
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
      114.78

For comparison, here are vanilla kernel results:
$ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
       59.83
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
       67.72
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
       63.62
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
       55.01
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
       65.20

> In *general*, I found that the series won a lot more than it lost across
> a spread of workloads and machines but unfortunately it's also an area
> where counter-examples can be found.

OK, fair enough. I understand that there will always be trade-offs
when making changes to scheduler like this. And I agree that cases
with higher system load (where is series is helpful) outweigh the
performance drops for low threads counts. I was hoping that it would
be possible to improve the small threads results while keeping the
gains for other scenarios:-)  But let's be realistic - I would be
happy to fix the extreme case mentioned above. The other issues where
performance drop is about 20% are OK with me and are outweighed by the
gains for different scenarios.

Thanks again for looking into it. I know that covering all cases is
hard. I very much appreciate what you do!

Jirka


On Thu, Mar 12, 2020 at 4:56 PM Mel Gorman <mgorman@...hsingularity.net> wrote:
>
> On Thu, Mar 12, 2020 at 01:10:36PM +0100, Jirka Hladky wrote:
> > Hi Mel,
> >
> > thanks a lot for analyzing it!
> >
> > My big concern is that the performance drop for low threads counts (roughly
> > up to 2x number of NUMA nodes) is not just a rare corner case, but it might
> > be more common.
>
> That is hard to tell. In my case I was seeing the problem running a HPC
> workload, dedicated to that machine and using only 6% of available CPUs. I
> find it unlikely that is common because who acquires such a large machine
> and then uses a tiny percentage of it. Rember also the other points I made
> such as 1M migrations happening in the first few seconds just trying to
> deal with the load balancer and NUMA balancer fighting each other. While
> that might happen to be good for SP, it's not universally good behaviour
> and I've dealt with issues in the past whereby the NUMA balancer would
> get confused and just ramp up the frequency it samples and migrates trying
> to override the load balancer.
>
> > We see the drop for the following benchmarks/tests,
> > especially on 8 NUMA nodes servers. However, four and even 2 NUMA node
> > servers are affected as well.
> >
> > Numbers show average performance drop (median of runtime collected from 5
> > subsequential runs) compared to vanilla kernel.
> >
> > 2x AMD 7351 (EPYC Naples), 8 NUMA nodes
> > ===================================
> > NAS: sp_C test: -50%, peak perf. drop with 8 threads
>
> I hadn't tested 8 threads specifically I think that works out as
> using 12.5% of the available machine. The allowed imbalance between
> nodes means that some SP instances will stack on the same node but not
> the same CPU.
>
> > NAS: mg_D: -10% with 16 threads
>
> While I do not have the most up to date figures available, I found the
> opposite trend at 21 threads (the test case I used were 1/3rd available
> CPUs and as close to maximum CPUs as possible). There I found it was 10%
> faster for an 8 node machine.
>
> For 4 nodes, using a single JVM was performance neutral *on average* but
> much less variable. With one JVM per node, there was a mix of small <2%
> regressions for some thread counts and up to 9% faster on others.
>
> > SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
> > SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8
> > threads)
>
> I didn't run SPECjvm for multiple thread sizes so I don't have data yet
> and may not have for another day at least.
>
> > SPECjbb2005: -10% for 16 threads
> >
>
> I found this to depend in the number of JVMs used and the thread count.
> Slower at low thread counts, faster at higher thread counts but with
> more stable results with the series applied and less NUMA balancer
> activity.
>
> This is somewhat of a dilemma. Without the series, the load balancer and
> NUMA balancer use very different criteria on what should happen and
> results are not stable. In some cases acting randomly happens to work
> out and in others it does not and overall it depends on the workload and
> machine. It's very highly dependent on both the workload and the machine
> and it's a question if we want to continue dealing with two parts of the
> scheduler disagreeing on what criteria to use or try to improve the
> reconciled load and memory balancer sharing similar logic.
>
> In *general*, I found that the series won a lot more than it lost across
> a spread of workloads and machines but unfortunately it's also an area
> where counter-examples can be found.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka