lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 10 Aug 2017 10:20:52 -0500
From:   Suravee Suthikulpanit <suravee.suthikulpanit@....com>
To:     linux-kernel@...r.kernel.org
Cc:     mingo@...hat.com, peterz@...radead.org, bp@...e.de,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierachy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
a properly balanced system should schedule 8 threads on 8 cpus
(e.g. T0 of C0-C7).  However, current scheduler would schedule
some threads on the same cpu, while others are idle.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierachy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA_IDEN (span a NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@....com>
---
 kernel/sched/topology.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..c57df98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
 	if (!sched_domains_numa_distance)
 		return;
 
+	/* Includes NUMA identity node at level 0. */
+	sched_domains_numa_distance[level++] = curr_distance;
+	sched_domains_numa_levels = level;
+
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
 		return;
 
 	/*
-	 * 'level' contains the number of unique distances, excluding the
-	 * identity distance node_distance(i,i).
+	 * 'level' contains the number of unique distances
 	 *
 	 * The sched_domains_numa_distance[] array includes the actual distance
 	 * numbers.
@@ -1445,9 +1448,24 @@ void sched_init_numa(void)
 		tl[i] = sched_domain_topology[i];
 
 	/*
+	 * Ignore the NUMA identity level if it has the same cpumask
+	 * as previous level. This is the case for:
+	 *   - System with last-level-cache (MC) sched domain span a NUMA node.
+	 *   - System with DIE sched domain span a NUMA node.
+	 *
+	 * Assume all NUMA nodes are identical, so only check node 0.
+	 */
+	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
+		tl[i++] = (struct sched_domain_topology_level){
+			.mask = sd_numa_mask,
+			.numa_level = 0,
+			SD_INIT_NAME(NUMA_IDEN)
+		};
+
+	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
-	for (j = 0; j < level; i++, j++) {
+	for (j = 1; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
 			.sd_flags = cpu_numa_flags,
-- 
2.7.4

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ