linux-kernel - [PATCH] sched/topology: Improve load balancing on AMD EPYC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20190605155922.17153-1-matt@codeblueprint.co.uk>
Date:   Wed,  5 Jun 2019 16:59:22 +0100
From:   Matt Fleming <matt@...eblueprint.co.uk>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org,
        Matt Fleming <matt@...eblueprint.co.uk>,
        "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        "Lendacky, Thomas" <Thomas.Lendacky@....com>,
        Borislav Petkov <bp@...en8.de>
Subject: [PATCH] sched/topology: Improve load balancing on AMD EPYC

SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in

  commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Update the code in sd_init() to account for modern node distances, and
maintaining backward-compatible behaviour by respecting
RECLAIM_DISTANCE for distances more than 2 hops.

Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>
Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@....com>
Cc: Mel Gorman <mgorman@...hsingularity.net>
Cc: "Lendacky, Thomas" <Thomas.Lendacky@....com>
Cc: Borislav Petkov <bp@...en8.de>
---
 kernel/sched/topology.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f53f89df837d..0eea395f7c6b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1410,7 +1410,18 @@ sd_init(struct sched_domain_topology_level *tl,
 
 		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
-		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+
+		/*
+		 * Strip the following flags for sched domains with a NUMA
+		 * distance greater than the historical 2-hops value
+		 * (RECLAIM_DISTANCE) and where tl->numa_level confirms it
+		 * really is more than 2 hops.
+		 *
+		 * Respecting RECLAIM_DISTANCE means we maintain
+		 * backwards-compatible behaviour.
+		 */
+		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE &&
+		    tl->numa_level > 3) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
 				       SD_WAKE_AFFINE);
-- 
2.13.7