lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 14 Sep 2017 09:12:22 -0700
From:   Suravee Suthikulpanit <Suravee.Suthikulpanit@....com>
To:     linux-kernel@...r.kernel.org
Cc:     mingo@...hat.com, peterz@...radead.org, bp@...e.de
Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched
 domain

Hi,

Are there any other concerns with this patch?

Thanks,
Suravee

On 9/7/17 00:20, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
>
>              ----------------------------
>          C0  | T0 T1 |    ||    | T0 T1 | C4
>              --------|    ||    |--------
>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>              --------|    ||    |--------
>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>              --------|    ||    |--------
>          C3  | T0 T1 |    ||    | T0 T1 | C7
>              ----------------------------
>
> Here, there are 2 last-level (L3) caches per logical NUMA node.
> A socket can contain upto 4 NUMA nodes, and a system can support
> upto 2 sockets. With full system configuration, current scheduler
> creates 4 sched domains:
>
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NUMA      (span a socket: 4 nodes)
>   domain3 NUMA      (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a logical
> NUMA node.  With this hierarchy of sched domains, the scheduler does
> not balance properly in the following cases:
>
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per logical NUMA node. This is not the case for
> the current scheduler.
>
> Case2:
> In some cases, threads are scheduled on the same cpu, while other
> cpus are idle. This results in run-to-run inconsistency. For example:
>
>   taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
>                           --cpu-max-prime=100000 run
>
> Total execution time ranges from 25.1s to 33.5s depending on threads
> placement, where 25.1s is when all 8 threads are balanced properly
> on 8 cpus.
>
> Introducing NUMA identity node sched domain, which is based on how
> SRAT/SLIT table define a logical NUMA node. This results in the following
> hierarchy of sched domains on the same system described above.
>
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NODE      (span a logical NUMA node)
>   domain3 NUMA      (span a socket: 4 nodes)
>   domain4 NUMA      (span a system: 8 nodes)
>
> This fixes the improper load balancing cases mentioned above.
>
> Note that in case cpumask of the last-level-cache and NODE domains
> are the same (e.g. on AMD family10h/15h servers), the NODE domain
> will be excluded. Therefore, this change will not affect those systems.
>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@....com>
> ---
>  kernel/sched/topology.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..98a8bbc 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
>  	if (!sched_domains_numa_distance)
>  		return;
>
> +	/* Includes NUMA identity node at level 0. */
> +	sched_domains_numa_distance[level++] = curr_distance;
> +	sched_domains_numa_levels = level;
> +
>  	/*
>  	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>  	 * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
>  		return;
>
>  	/*
> -	 * 'level' contains the number of unique distances, excluding the
> -	 * identity distance node_distance(i,i).
> +	 * 'level' contains the number of unique distances
>  	 *
>  	 * The sched_domains_numa_distance[] array includes the actual distance
>  	 * numbers.
> @@ -1445,9 +1448,26 @@ void sched_init_numa(void)
>  		tl[i] = sched_domain_topology[i];
>
>  	/*
> +	 * Do not setup NUMA node level if it has the same cpumask
> +	 * as sched domain at previous level. This is the case for
> +	 * system with:
> +	 *  LLC == NODE : LLC (MC) sched domain span a NUMA node.
> +	 *  DIE == NODE : DIE sched domain span a NUMA node.
> +	 *
> +	 * Assume all NUMA nodes are identical, so only check node 0.
> +	 */
> +	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
> +		tl[i++] = (struct sched_domain_topology_level){
> +			.mask = sd_numa_mask,
> +			.numa_level = 0,
> +			SD_INIT_NAME(NODE)
> +		};
> +	}
> +
> +	/*
>  	 * .. and append 'j' levels of NUMA goodness.
>  	 */
> -	for (j = 0; j < level; i++, j++) {
> +	for (j = 1; j < level; i++, j++) {
>  		tl[i] = (struct sched_domain_topology_level){
>  			.mask = sd_numa_mask,
>  			.sd_flags = cpu_numa_flags,
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ