lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <95039c1a-7839-d758-e882-1baaf1337960@amd.com> Date: Thu, 14 Sep 2017 09:12:22 -0700 From: Suravee Suthikulpanit <Suravee.Suthikulpanit@....com> To: linux-kernel@...r.kernel.org Cc: mingo@...hat.com, peterz@...radead.org, bp@...e.de Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched domain Hi, Are there any other concerns with this patch? Thanks, Suravee On 9/7/17 00:20, Suravee Suthikulpanit wrote: > On AMD Family17h-based (EPYC) system, a logical NUMA node can contain > upto 8 cores (16 threads) with the following topology. > > ---------------------------- > C0 | T0 T1 | || | T0 T1 | C4 > --------| || |-------- > C1 | T0 T1 | L3 || L3 | T0 T1 | C5 > --------| || |-------- > C2 | T0 T1 | #0 || #1 | T0 T1 | C6 > --------| || |-------- > C3 | T0 T1 | || | T0 T1 | C7 > ---------------------------- > > Here, there are 2 last-level (L3) caches per logical NUMA node. > A socket can contain upto 4 NUMA nodes, and a system can support > upto 2 sockets. With full system configuration, current scheduler > creates 4 sched domains: > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) > domain2 NUMA (span a socket: 4 nodes) > domain3 NUMA (span a system: 8 nodes) > > Note that there is no domain to represent cpus spaning a logical > NUMA node. With this hierarchy of sched domains, the scheduler does > not balance properly in the following cases: > > Case1: > When running 8 tasks, a properly balanced system should > schedule a task per logical NUMA node. This is not the case for > the current scheduler. > > Case2: > In some cases, threads are scheduled on the same cpu, while other > cpus are idle. This results in run-to-run inconsistency. For example: > > taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ > --cpu-max-prime=100000 run > > Total execution time ranges from 25.1s to 33.5s depending on threads > placement, where 25.1s is when all 8 threads are balanced properly > on 8 cpus. > > Introducing NUMA identity node sched domain, which is based on how > SRAT/SLIT table define a logical NUMA node. This results in the following > hierarchy of sched domains on the same system described above. > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) > domain2 NODE (span a logical NUMA node) > domain3 NUMA (span a socket: 4 nodes) > domain4 NUMA (span a system: 8 nodes) > > This fixes the improper load balancing cases mentioned above. > > Note that in case cpumask of the last-level-cache and NODE domains > are the same (e.g. on AMD family10h/15h servers), the NODE domain > will be excluded. Therefore, this change will not affect those systems. > > Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@....com> > --- > kernel/sched/topology.c | 26 +++++++++++++++++++++++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 79895ae..98a8bbc 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1335,6 +1335,10 @@ void sched_init_numa(void) > if (!sched_domains_numa_distance) > return; > > + /* Includes NUMA identity node at level 0. */ > + sched_domains_numa_distance[level++] = curr_distance; > + sched_domains_numa_levels = level; > + > /* > * O(nr_nodes^2) deduplicating selection sort -- in order to find the > * unique distances in the node_distance() table. > @@ -1382,8 +1386,7 @@ void sched_init_numa(void) > return; > > /* > - * 'level' contains the number of unique distances, excluding the > - * identity distance node_distance(i,i). > + * 'level' contains the number of unique distances > * > * The sched_domains_numa_distance[] array includes the actual distance > * numbers. > @@ -1445,9 +1448,26 @@ void sched_init_numa(void) > tl[i] = sched_domain_topology[i]; > > /* > + * Do not setup NUMA node level if it has the same cpumask > + * as sched domain at previous level. This is the case for > + * system with: > + * LLC == NODE : LLC (MC) sched domain span a NUMA node. > + * DIE == NODE : DIE sched domain span a NUMA node. > + * > + * Assume all NUMA nodes are identical, so only check node 0. > + */ > + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) { > + tl[i++] = (struct sched_domain_topology_level){ > + .mask = sd_numa_mask, > + .numa_level = 0, > + SD_INIT_NAME(NODE) > + }; > + } > + > + /* > * .. and append 'j' levels of NUMA goodness. > */ > - for (j = 0; j < level; i++, j++) { > + for (j = 1; j < level; i++, j++) { > tl[i] = (struct sched_domain_topology_level){ > .mask = sd_numa_mask, > .sd_flags = cpu_numa_flags, >
Powered by blists - more mailing lists