linux-kernel - Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAE4VaGA+GOh-wgHBbSsgpRVXgrGtz8egu6dYp143TAH0siL5fA@mail.gmail.com>
Date:   Fri, 20 Mar 2020 16:33:43 +0100
From:   Jirka Hladky <jhladky@...hat.com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load
 balancer v6

> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.

Yes, it's indeed OMP.  With low threads count, I mean up to 2x number
of NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA
node servers).

> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.

I think the only way to find out how it performs is to test it. If you
could prepare a patch like that, I'm more than happy to give it a try!

Jirka


On Fri, Mar 20, 2020 at 4:22 PM Mel Gorman <mgorman@...hsingularity.net> wrote:
>
> On Fri, Mar 20, 2020 at 03:37:44PM +0100, Jirka Hladky wrote:
> > Hi Mel,
> >
> > just a quick update. I have increased the testing coverage and other tests
> > from the NAS shows a big performance drop for the low number of threads as
> > well:
> >
> > sp_C_x - show still the biggest drop upto 50%
> > bt_C_x - performance drop upto 40%
> > ua_C_x - performance drop upto 30%
> >
>
> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.
>
> > My point is that the performance drop for the low number of threads is more
> > common than we have initially thought.
> >
> > Let me know what you need more data.
> >
>
> I just a clarification on the thread count and a confirmation it's OMP. For
> MPI, I did note that some of the other NAS kernels shows a slight dip but
> it was nowhere near as severe as SP and the problem was the same as more --
> two or more tasks stayed on the same node without spreading out because
> there was no pressure to do so. There was enough CPU and memory capacity
> with no obvious pattern that could be used to spread the load wide early.
>
> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka