linux-kernel - RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8cfd37e2617248f4b008f5564eb854a9@hisilicon.com>
Date:   Mon, 1 Feb 2021 21:49:17 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Valentin Schneider <valentin.schneider@....com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "mingo@...nel.org" <mingo@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
        "morten.rasmussen@....com" <morten.rasmussen@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "linuxarm@...neuler.org" <linuxarm@...neuler.org>,
        "xuwei (O)" <xuwei5@...wei.com>,
        "Liguozhu (Kenneth)" <liguozhu@...ilicon.com>,
        "tiantao (H)" <tiantao6@...ilicon.com>,
        wanghuiqiang <wanghuiqiang@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "guodong.xu@...aro.org" <guodong.xu@...aro.org>,
        Meelis Roos <mroos@...ux.ee>
Subject: RE: [PATCH] sched/topology: fix the issue groups don't span
 domain->span for NUMA diameter > 2



> -----Original Message-----
> From: Valentin Schneider [mailto:valentin.schneider@....com]
> Sent: Tuesday, February 2, 2021 7:11 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>;
> vincent.guittot@...aro.org; mgorman@...e.de; mingo@...nel.org;
> peterz@...radead.org; dietmar.eggemann@....com; morten.rasmussen@....com;
> linux-kernel@...r.kernel.org
> Cc: linuxarm@...neuler.org; xuwei (O) <xuwei5@...wei.com>; Liguozhu (Kenneth)
> <liguozhu@...ilicon.com>; tiantao (H) <tiantao6@...ilicon.com>; wanghuiqiang
> <wanghuiqiang@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; Jonathan
> Cameron <jonathan.cameron@...wei.com>; guodong.xu@...aro.org; Song Bao Hua
> (Barry Song) <song.bao.hua@...ilicon.com>; Meelis Roos <mroos@...ux.ee>
> Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 
> Hi,
> 
> On 01/02/21 16:38, Barry Song wrote:
> > A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2
> > for the sched_group generated by grandchild, otherwise, when this cpu
> > becomes the balance_cpu of another sched_group of cpus other than node0,
> > our sched_group generated by grandchild will access the same sgc with
> > the sched_group generated by child of another CPU.
> >
> > So in init_overlap_sched_group(), sgc's capacity be overwritten:
> >         build_balance_mask(sd, sg, mask);
> >         cpu = cpumask_first_and(sched_group_span(sg), mask);
> >
> >         sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will
> > also be triggered:
> > static void init_overlap_sched_group(struct sched_domain *sd,
> >                                      struct sched_group *sg)
> > {
> >         if (atomic_inc_return(&sg->sgc->ref) == 1)
> >                 cpumask_copy(group_balance_mask(sg), mask);
> >         else
> >                 WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
> > }
> >
> > So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA
> > has only one CPU, we will still trigger this WARN_ON_ONCE. But It is
> > really unlikely to be a real case for one NUMA to have one CPU only.
> >
> 
> Well, it's trivial to boot this with QEMU, and it's actually the example
> the comment atop that WARN_ONCE() is based on. Also, you could end up with
> a single CPU on a node during hotplug operations...

Hi Valentin,

The qemu topology is just a reflection of real kunpeng920 case, and pls
also note Meelis has also tested on another real hardware "8-node Sun
Fire X4600-M2" and gave the tested-by.

It might not a perfect fix, but it is the simplest way to fix for this
moment and for real cases. A "perfect" fix will require major
refactoring of topology.c.

I don't think hotplug is much relevant as even some cpus are unplugged
and only one cpu is left in the sched_group of the sched_domain, the
related domain and group are still getting right settings.

On the other hand, the corner could literally  be fixed, but will
get some very ugly code involved. I mean, two sched_group can result
in using the same sgc:
1. the sched_group generated by grandchild with only one numa
2. the sched_group generated by child with more than one numa

Right now, I'm moving to the 2nd cpu for sched_group1, if we move to
use 2nd cpu for sched_group2, then having only one cpu in one NUMA
wouldn't be a problem anymore. But the code will be very ugly.
So I would prefer to keep this assumption and just ignore the unreal
corner case.

> 
> I am not entirely sure whether having more than one CPU per node is a
> sufficient condition. I'm starting to *think* it is, but I'm not entirely
> convinced yet - and now I need a new notebook.

Me too. Some extremely complicated topology might break the assumption.
Really need a new notebook to draw this kind of complicated topology to
break the assumption :-)

But it is sufficient for the existing real cases which need fixing. When
someday a real case in which each numa has more than one CPU wakes up
the below warning:
WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)).
It might be the right time to consider major refactoring of topology.c.

Thanks
Barry