linux-kernel - [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220511143038.4620-1-mgorman@techsingularity.net>
Date:   Wed, 11 May 2022 15:30:34 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour

A problem was reported privately related to inconsistent performance of
NAS when parallelised with MPICH. The root of the problem is that the
initial placement is unpredictable and there can be a larger imbalance
than expected between NUMA nodes. As there is spare capacity and the faults
are local, the imbalance persists for a long time and performance suffers.

This is not 100% an "allowed imbalance" problem as setting the allowed
imbalance to 0 does not fix the issue but the allowed imbalance contributes
the the performance problem. The unpredictable behaviour was most recently
introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of
blocked load on newly idle cpu").

mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the
execing the target workload. As the new tasks are sleeping, the potential
imbalance is not observed as idle_cpus does not reflect the tasks that
will be running in the near future. How bad the problem depends on the
timing of when fork happens and whether the new tasks are still running.
Consequently, a large initial imbalance may not be detected until the
workload is fully running. Once running, NUMA Balancing picks the preferred
node based on locality and runtime load balancing often ignores the tasks
as can_migrate_task() fails for either locality or task_hot reasons and
instead picks unrelated tasks.

This is the min, max and range of run time for mg.D parallelised with ~25%
of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs,
16 active for mg.D due to limitations of mg.D).

v5.3                         Min  95.84 Max  96.55 Range   0.71 Mean  96.16
v5.7                         Min  95.44 Max  96.51 Range   1.07 Mean  96.14
v5.8                         Min  96.02 Max 197.08 Range 101.06 Mean 154.70
v5.12                        Min 104.45 Max 111.03 Range   6.58 Mean 105.94
v5.13                        Min 104.38 Max 170.37 Range  65.99 Mean 117.35
v5.13-revert-c6f886546cb8    Min 104.40 Max 110.70 Range   6.30 Mean 105.68 
v5.18rc4-baseline            Min 104.46 Max 169.04 Range  64.58 Mean 130.49
v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range   3.31 Mean 114.71
v5.18rc4-this_series         Min  95.24 Max 175.33 Range  80.09 Mean 108.91
v5.18rc4-this_series+revert  Min  95.24 Max  99.87 Range   4.63 Mean  96.54

This shows that we've had unpredictable performance for a long time for
this load. Instability was introduced somewhere between v5.7 and v5.8,
fixed in v5.12 and broken again since v5.13.  The revert against 5.13
and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability
although the best case is still worse than 5.7.

This series addresses the allowed imbalance problems to get the peak
performance back to 5.7 although only some of the time due to the
instability problem. The series plus the revert is both stable and has
slightly better peak performance and similar average performance. I'm
not convinced commit c6f886546cb8 is wrong but haven't isolated exactly
why it's unstable so for now, I'm just noting it has an issue.

Patch 1 initialises numa_migrate_retry. While this resolves itself
	eventually, it is unpredictable early in the lifetime of
	a task.

Patch 2 will not swap NUMA tasks in the same NUMA group or without
	a NUMA group if there is spare capacity. Swapping is just
	punishing one task to help another.

Patch 3 fixes an issue where a larger imbalance can be created at
	fork time than would be allowed at run time. This behaviour
	can help some workloads that are short lived and prefer
	to remain local but it punishes long-lived tasks that are
	memory intensive.

Patch 4 adjusts the threshold where a NUMA imbalance is allowed to
	better approximate the number of memory channels, at least
	for x86-64.

 kernel/sched/fair.c     | 59 ++++++++++++++++++++++++++---------------
 kernel/sched/topology.c | 23 ++++++++++------
 2 files changed, 53 insertions(+), 29 deletions(-)

-- 
2.34.1