linux-kernel - EEVDF and NUMA balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.22.394.2310032059060.3220@hadrien>
Date:   Tue, 3 Oct 2023 22:25:08 +0200 (CEST)
From:   Julia Lawall <julia.lawall@...ia.fr>
To:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: EEVDF and NUMA balancing

Is it expected that the commit e8f331bcc270 should have an impact on the
frequency of NUMA balancing?

The NAS benchmark ua.C.x (NPB3.4-OMP,
https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
too few threads and other sockets with too many threads.  Prior to the
commit e8f331bcc270, this was corrected by subsequent load balancing,
leading to run times of 20-40 seconds (around 20 seconds can be achieved
if one just turns NUMA balancing off).  After commit e8f331bcc270, the
running time can go up to 150 seconds.  In the worst case, I have seen a
core remain idle for 75 seconds.  It seems that the load balancer at the
NUMA domain level is not able to do anything, because when a core on the
overloaded socket has multiple threads, they are tasks that were NUMA
balanced to the socket, and thus should not leave.  So the "busiest" core
chosen by find_busiest_queue doesn't actually contain any stealable
threads.  Maybe it could be worth stealing from a core that has only one
task in this case, in hopes that the tasks that are tied to a socket will
spread out better across it if more space is available?

An example run is attached.  The cores are renumbered according to the
sockets, so there is an overload on socket 1 and an underload on sockets
2.

julia
Download attachment "ua.C.x_yeti-2_ge8f331bcc270_performance_18_socketorder.pdf" of type "application/pdf" (331276 bytes)