linux-kernel - RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <13707f648ee547b6afe2285011c58b4f@hisilicon.com>
Date:   Mon, 25 Jan 2021 03:13:38 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Valentin Schneider <valentin.schneider@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Mel Gorman <mgorman@...e.de>
CC:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "linuxarm@...neuler.org" <linuxarm@...neuler.org>
Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of
 NUMA diameter > 2

> >
> >
> > Hi,
> >
> > On 18/01/21 11:25, Song Bao Hua wrote:
> > >> -----Original Message-----
> > >> From: Vincent Guittot [mailto:vincent.guittot@...aro.org]
> > >> Sent: Tuesday, January 19, 2021 12:14 AM
> > >> To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>
> > >> Cc: Ingo Molnar <mingo@...nel.org>; Peter Zijlstra <peterz@...radead.org>;
> > >> Dietmar Eggemann <dietmar.eggemann@....com>; Morten Rasmussen
> > >> <morten.rasmussen@....com>; Valentin Schneider
> > <valentin.schneider@....com>;
> > >> linux-kernel <linux-kernel@...r.kernel.org>; Mel Gorman
> <mgorman@...e.de>;
> > >> linuxarm@...neuler.org
> > >> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> > >> of NUMA diameter > 2
> > >>
> > >> On Fri, 15 Jan 2021 at 21:42, Barry Song <song.bao.hua@...ilicon.com> wrote:
> > >> >
> > >> > This patch is a follow-up of the 3-hops issue reported by Valentin
> Schneider:
> > >> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> > >> > [2]
> > >>
> >
> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
> > >> .com/
> > >> >
> > >> > Here is a brief summary of the background:
> > >> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not
> > a
> > >> > subset of sched_domain.
> > >> > For example, for a system with the below topology(two cpus in each NUMA
> > >> > node):
> > >> > node   0   1   2   3
> > >> >   0:  10  12  20  22
> > >> >   1:  12  10  22  24
> > >> >   2:  20  22  10  12
> > >> >   3:  22  24  12  10
> > >> >
> > >> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> > >> > 4-7 isn't a subset of 0-5.
> > >> >
> > >> > CPU0 attaching sched-domain(s):
> > >> >  domain-0: span=0-1 level=MC
> > >> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
> > >> >   domain-1: span=0-3 level=NUMA
> > >> >    groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
> > >> >    domain-2: span=0-5 level=NUMA
> > >> >     groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
> > >> >  ERROR: groups don't span domain->span
> > >> >     domain-3: span=0-7 level=NUMA
> > >> >      groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7
> > cap=3928 }
> > >> >
> > >> > All other cpus also have the same issue: sched_group could be not a subset
> > >> > of sched_domain.
> > >> >
> > >> > Here I am trying to figure out the scheduling impact of this issue from
> > >> > two aspects:
> > >> > 1. find busiest cpu in load_balance
> > >> > 2. find idlest cpu in fork/exec/wake balance
> > >>
> > >> Would be better to fix the error in the sched domain topology instead
> > >> of hacking the load balance to compensate the topology problem
> > >
> > > I think Valentin Schneider has tried to do that before, but failed. This
> will
> > > add some new groups which won't be managed by current
> update_group_capacity()?
> > > @Valentine, would you like to share more details?
> > >
> >
> > Sorry for being late to the party, this is gnarly stuff and I can't dive
> > back into it without spending some time staring at my notes and diagrams...
> > I did indeed try to fix the group construction, thinking it would "just" be
> > a matter of changing one mask into another, but it turned out to be quite
> > trickier.
> >
> > Let's go back to https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com/
> >
> > Right now, for that #Case study w/ QEMU platform, we get:
> >
> >   CPU0-domain1: span=0-2
> >     group0: span=0-2, mask=0
> >     group2: span=1-3, mask=2 # CPU3 shouldn't be included
> >   CPU1-domain1: span=0-3
> >     group1: span=0-2, mask=1
> >     group3: span=2-3, mask=3
> >   CPU2-domain1: span=0-3
> >     group2: span=1-3, mask=2
> >     group0: span=0-1, mask=0
> >   CPU3-domain1: span=0-2
> >     group3: span=2-3, mask=3
> >     group1: span=0-2, mask=1 # CPU0 shouldn't be included
> >
> > We would want to "fix" this into:
> >
> >   CPU0-domain1
> >     group0: span=0-2, mask=0
> >   - group2: span=1-3, mask=2
> >   + group?: span=1-2, mask=??
> >   CPU1-domain1
> >     group1: span=0-2, mask=1
> >     group3: span=2-3, mask=3
> >   CPU2-domain1
> >     group2: span=1-3, mask=2
> >     group0: span=0-1, mask=0
> >   CPU3-domain1
> >     group3: span=2-3, mask=3
> >   - group1: span=0-2, mask=1
> >   + group?: span=1-2, mask=??
> >
> > Note the '?' for the group ID and for the group balance mask. What I tried
> > to hint at when writing this is that, right now, there is no sane ID nor
> > balance mask to give to those "new" groups.
> >
> > The group ID is supposed to be the CPU owning the sched_group_capacity
> > structure for the unique group span, which right now is the first CPU in
> > the balance mask - I recommend reading the comments atop
> > group_balance_cpu(), build_balance_mask() and get_group().
> >
> > Here, we would end up with 5 unique group spans despite only having 4 CPUs:
> > our allocation scheme doesn't hold up anymore. This stems from the way we
> > allocate our topology data: we have a percpu slot per topology level.
> >
> > Furthermore, these "new" groups won't be the local group of any CPU,
> > which means update_group_capacity() will never visit them - their capacity
> > will never be updated.
> >
> >
> > Here are some possible ways forward:
> > - Have extra storage in struct sd_data for sched_group_capacity of those
> >   new, non-local groups. There might be topologies where you'll need to
> >   store more than one such group per CPU in there.
> > - During load balance stats update, update the local group *and* all of
> >   those new, non-local ones.
> >
> > Conceptually I think this is what would be required, but it does feel very
> > duct-tape-y...
> 
> Yep. kernel is building sched_groups of domain[n] by using the child domains
> domain[n-1] of those cpus in the span of domain[n].
> so the new groups added by you don't have same span with the child domain
> domain[n-1]. This kind of groups will be quite weird and need be maintained
> separately.

As long as NUMA diameter > 2, building sched_domain by sibling's child domain
will definitely create a sched_domain with sched_group which will span
out of the sched_domain
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1            node2          node3

domain1        node0+1          node0+1          node2+3        node2+3
                                                 +
domain2        node0+1+2                         |
             group: node0+1                      |
               group:node2+3 <-------------------+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of node0+1+2.

Will we move to use the *child* domain of the *child* domain of node2's
domain2 to build the sched_group?

I mean:
               +------+         +------+        +-------+       +------+
               | node |  12     |node  | 20     | node  |  12   |node  |
               |  0   +---------+1     +--------+ 2     +-------+3     |
               +------+         +------+        +-------+       +------+

domain0        node0            node1          +- node2          node3
                                               |
domain1        node0+1          node0+1        | node2+3        node2+3
                                               |
domain2        node0+1+2                       |
             group: node0+1                    |
               group:node2 <-------------------+

In this way, it seems we don't have to create a new group as we are just
reusing the existing group?

> 
> >
> > > On the other hand, another purpose of this RFC is that I also want to dig
> > into
> > > more details about how the 3-hops issue could affect the behavior of scheduler.
> > > In Valentine's original thread, I think we haven't figured out how the issue
> > > will really impact scheduling.
> > >
> >
> > I think the long story short was that since you can end up with groups
> > spanning CPUs farther away than what the domain represents, the load
> > balance stats computation (to figure out which busiest group to pull from)
> > can and will be skewered.
> 
> Yes. My patch mentioned two places where load balance stats are skewered.
> 
> 1. find_busiest_group() in load_balance()
> Just in case domain span is 0-3, one of its groups has span 2-5.
> 4 and 5 don't belong to the domain 0-3.
> 
> While calculating the load of group, update_sg_lb_stats() will do
> (the sum of cpu2 and cpu3)/(capacity of the whole group cpu2-5).
> 
> So the load of group2-5 will be underestimated. my patch moved to
> do this:
> (the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)
> 
> It actually gets a relatively correct load of cpu2-3 which is a
> part of group 2-5. If this "half" group has high load, another
> group still have chance to pull task from cpu2 and cpu3.
> 
> 2. find_idlest_group() in select_task_rq_fair()
> This is mainly for placing a new forked/exec-ed task on an idle cpu.
> 
> In this path, I found there is totally no safeguard to prevent pushing
> task to outside the domain span. And the load calculation will count all
> cpus in the group which has cpu outside the domain.
> (the sum of cpu2,3,4,5)/(capacity of the whole group cpu2-5)
> 
> What I have done here is moving to do load stats update in
> update_sg_wakeup_stats() by:
> (the sum of cpu2 and cpu3)/(the sum of capacity of cpu2-3)
> 
> and add a safeguard to prevent pushing task to cpu 4-5.
> 
> >
> > There are safeguards to prevent pulling from outside the domain span, but
> > that doesn't protect the stats.
> 
> I did see pulling task won't go outside the domain in find_busiest_queue().
> But since the load calculation is wrong, task pulling could happen in the
> wrong direction.
> 

Thanks
Barry