lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 25 Jan 2021 21:55:40 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Valentin Schneider <valentin.schneider@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Mel Gorman <mgorman@...e.de>
CC:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "linuxarm@...neuler.org" <linuxarm@...neuler.org>
Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of
 NUMA diameter > 2



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@....com]
> Sent: Tuesday, January 26, 2021 1:11 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>; Vincent Guittot
> <vincent.guittot@...aro.org>; Mel Gorman <mgorman@...e.de>
> Cc: Ingo Molnar <mingo@...nel.org>; Peter Zijlstra <peterz@...radead.org>;
> Dietmar Eggemann <dietmar.eggemann@....com>; Morten Rasmussen
> <morten.rasmussen@....com>; linux-kernel <linux-kernel@...r.kernel.org>;
> linuxarm@...neuler.org
> Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child domain
> > will definitely create a sched_domain with sched_group which will span
> > out of the sched_domain
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1            node2          node3
> >
> > domain1        node0+1          node0+1          node2+3        node2+3
> >                                                  +
> > domain2        node0+1+2                         |
> >              group: node0+1                      |
> >                group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of node0+1+2.
> >
> > Will we move to use the *child* domain of the *child* domain of node2's
> > domain2 to build the sched_group?
> >
> > I mean:
> >                +------+         +------+        +-------+       +------+
> >                | node |  12     |node  | 20     | node  |  12   |node  |
> >                |  0   +---------+1     +--------+ 2     +-------+3     |
> >                +------+         +------+        +-------+       +------+
> >
> > domain0        node0            node1          +- node2          node3
> >                                                |
> > domain1        node0+1          node0+1        | node2+3        node2+3
> >                                                |
> > domain2        node0+1+2                       |
> >              group: node0+1                    |
> >                group:node2 <-------------------+
> >
> > In this way, it seems we don't have to create a new group as we are just
> > reusing the existing group?
> >
> 
> One thing I've been musing over is pretty much this; that is to say we
> would make all non-local NUMA sched_groups span a single node. This would
> let us reuse an existing span+sched_group_capacity: the local group of that
> node at its first NUMA topology level.
> 
> Essentially this means getting rid of the overlapping groups, and the
> balance mask is handled the same way as for !NUMA, i.e. it's the local
> group span. I've not gone far enough through the thought experiment to see
> where does it miserably fall apart... It is at the very least violating the
> expectation that a group span is a child domain's span - here it can be a
> grand^x children domain's span.
> 
> 
> If we take your topology, we currently have:
> 
> | tl\node | 0            | 1             | 2             | 3            |
> |---------+--------------+---------------+---------------+--------------|
> | NUMA0   | (0)->(1)     | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)     |
> | NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
> | NUMA2   | (0-2)->(1-3) | N/A           | N/A           | (1-3)->(0-2) |
> 
> With the current overlapping group scheme, we would need to make it look
> like so:
> 
> | tl\node | 0             | 1             | 2             | 3             |
> |---------+---------------+---------------+---------------+---------------
> |
> | NUMA0   | (0)->(1)      | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)      |
> | NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
> | NUMA2   | (0-2)->(1-3)  | N/A           | N/A           | (1-3)->(0-2)  |
> 
> But as already discussed, that's tricky to make work. With the node-span
> groups thing, we would turn this into:
> 
> | tl\node | 0          | 1             | 2             | 3          |
> |---------+------------+---------------+---------------+------------|
> | NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
> | NUMA1   | (0-1)->(2) | (0-2)->(3)    | (1-3)->(0)    | (2-3)->(1) |
> | NUMA2   | (0-2)->(3) | N/A           | N/A           | (1-3)->(0) |

Actually I didn't mean going that far. What I was thinking is that
we only fix the sched_domain while sched_group isn't a subset of
sched_domain. For those sched_domains which haven't the group span
issue, we just don't touch it. For NUMA1, we change like your diagram,
but NUMA2 won't be changed. The concept is like:

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1040,6 +1040,19 @@ build_overlap_sched_groups(struct sched_domain
*sd, int cpu)
                }

                sg_span = sched_group_span(sg);
+#if 1
+               if (sibling->child && !cpumask_subset(sg_span, span)) {
+                       sg = build_group_from_child_sched_domain(sibling->child, cpu);
+                       ...
+                       sg_span = sched_group_span(sg);
+               }
+#endif
                cpumask_or(covered, covered, sg_span);

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ