linux-kernel - Re: [PATCH 14/14] sched/topology: Add a few comments

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170501083507.hccopl7rxtmrfsrd@hirez.programming.kicks-ass.net>
Date:   Mon, 1 May 2017 10:35:07 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     mingo@...nel.org, lvenanci@...hat.com
Cc:     lwang@...hat.com, riel@...hat.com, efault@....de,
        tglx@...utronix.de, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 14/14] sched/topology: Add a few comments

On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * NUMA topology (first read the regular topology blurb below)
> + *
> + * Given a node-distance table, for example:
> + *
> + *   node   0   1   2   3
> + *     0:  10  20  30  20
> + *     1:  20  10  20  30
> + *     2:  30  20  10  20
> + *     3:  20  30  20  10
> + *
> + * which represents a 4 node ring topology like:
> + *
> + *   0 ----- 1
> + *   |       |
> + *   |       |
> + *   |       |
> + *   3 ----- 2
> + *
> + * We want to construct domains and groups to represent this. The way we go
> + * about doing this is to build the domains on 'hops'. For each NUMA level we
> + * construct the mask of all nodes reachable in @level hops.
> + *
> + * For the above NUMA topology that gives 3 levels:
> + *
> + * NUMA-2	0-3		0-3		0-3		0-3
> + *  groups:	{0-1,3},{1-3}	{0-2},{0,2-3}	{1-3},{0-1,3}	{0,2-3},{0-2}
> + *
> + * NUMA-1	0-1,3		0-2		1-3		0,2-3
> + *  groups:	{0},{1},{3}	{0},{1},{2}	{1},{2},{3}	{0},{2},{3}
> + *
> + * NUMA-0	0		1		2		3
> + *
> + *
> + * As can be seen; things don't nicely line up as with the regular topology.
> + * When we iterate a domain in child domain chunks some nodes can be
> + * represented multiple times -- hence the "overlap" naming for this part of
> + * the topology.
> + *
> + * In order to minimize this overlap, we only build enough groups to cover the
> + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
> + *
> + * Because:
> + *
> + *  - the first group of each domain is its child domain; this
> + *    gets us the first 0-1,3
> + *  - the only uncovered node is 2, who's child domain is 1-3.
> + *
> + * However, because of the overlap, computing a unique CPU for each group is
> + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
> + * groups include the CPUs of Node-0, while those CPUs would not in fact ever
> + * end up at those groups (they would end up in group: 0-1,3).
> + *
> + * To correct this we have to introduce the group iteration mask. This mask
> + * will contain those CPUs in the group that can reach this group given the
> + * (child) domain tree.
> + *
> + * With this we can once again compute balance_cpu and sched_group_capacity
> + * relations.
> + *
> + * XXX include words on how balance_cpu is unique and therefore can be
> + * used for sched_group_capacity links.
> + */
> +

I added the below to clarify some of the asymmetric comments we have.


--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group
  *
  * XXX include words on how balance_cpu is unique and therefore can be
  * used for sched_group_capacity links.
+ *
+ *
+ * Another 'interesting' topology is:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  20  30
+ *     1:  20  10  20  20
+ *     2:  20  20  10  20
+ *     3:  30  20  20  10
+ *
+ * Which looks a little like:
+ *
+ *   0 ----- 1
+ *   |     / |
+ *   |   /   |
+ *   | /     |
+ *   2 ----- 3
+ *
+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
+ * are not.
+ *
+ * This leads to a few particularly weird cases where the sched_domain's are
+ * not of the same number for each cpu. Consider:
+ *
+ * NUMA-2	0-3						0-3
+ *  groups:	{0-2},{1-3}					{1-3},{0-2}
+ *
+ * NUMA-1	0-2		0-3		0-3		1-3
+ *
+ * NUMA-0	0		1		2		3
+ *
  */