lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 3 Oct 2023 22:25:08 +0200 (CEST)
From:   Julia Lawall <julia.lawall@...ia.fr>
To:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: EEVDF and NUMA balancing

Is it expected that the commit e8f331bcc270 should have an impact on the
frequency of NUMA balancing?

The NAS benchmark ua.C.x (NPB3.4-OMP,
https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
too few threads and other sockets with too many threads.  Prior to the
commit e8f331bcc270, this was corrected by subsequent load balancing,
leading to run times of 20-40 seconds (around 20 seconds can be achieved
if one just turns NUMA balancing off).  After commit e8f331bcc270, the
running time can go up to 150 seconds.  In the worst case, I have seen a
core remain idle for 75 seconds.  It seems that the load balancer at the
NUMA domain level is not able to do anything, because when a core on the
overloaded socket has multiple threads, they are tasks that were NUMA
balanced to the socket, and thus should not leave.  So the "busiest" core
chosen by find_busiest_queue doesn't actually contain any stealable
threads.  Maybe it could be worth stealing from a core that has only one
task in this case, in hopes that the tasks that are tied to a socket will
spread out better across it if more space is available?

An example run is attached.  The cores are renumbered according to the
sockets, so there is an overload on socket 1 and an underload on sockets
2.

julia
Download attachment "ua.C.x_yeti-2_ge8f331bcc270_performance_18_socketorder.pdf" of type "application/pdf" (331276 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ