lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94c2e3b176e542afa03bea4aa0da7c9c@hisilicon.com>
Date:   Mon, 25 Jan 2021 11:12:04 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Tim Chen <tim.c.chen@...ux.intel.com>
CC:     "valentin.schneider@....com" <valentin.schneider@....com>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "lenb@...nel.org" <lenb@...nel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "bsegall@...gle.com" <bsegall@...gle.com>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "mark.rutland@....com" <mark.rutland@....com>,
        "sudeep.holla@....com" <sudeep.holla@....com>,
        "aubrey.li@...ux.intel.com" <aubrey.li@...ux.intel.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linuxarm@...neuler.org" <linuxarm@...neuler.org>,
        "xuwei (O)" <xuwei5@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        "tiantao (H)" <tiantao6@...ilicon.com>
Subject: RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
 add cluster scheduler



> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@....com]
> Sent: Wednesday, January 13, 2021 1:53 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>; Morten Rasmussen
> <morten.rasmussen@....com>; Tim Chen <tim.c.chen@...ux.intel.com>
> Cc: valentin.schneider@....com; catalin.marinas@....com; will@...nel.org;
> rjw@...ysocki.net; vincent.guittot@...aro.org; lenb@...nel.org;
> gregkh@...uxfoundation.org; Jonathan Cameron <jonathan.cameron@...wei.com>;
> mingo@...hat.com; peterz@...radead.org; juri.lelli@...hat.com;
> rostedt@...dmis.org; bsegall@...gle.com; mgorman@...e.de;
> mark.rutland@....com; sudeep.holla@....com; aubrey.li@...ux.intel.com;
> linux-arm-kernel@...ts.infradead.org; linux-kernel@...r.kernel.org;
> linux-acpi@...r.kernel.org; linuxarm@...neuler.org; xuwei (O)
> <xuwei5@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; tiantao (H)
> <tiantao6@...ilicon.com>
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
> >
> >> -----Original Message-----
> >> From: Morten Rasmussen [mailto:morten.rasmussen@....com]
> >> Sent: Saturday, January 9, 2021 4:13 AM
> >> To: Tim Chen <tim.c.chen@...ux.intel.com>
> >> Cc: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>;
> >> valentin.schneider@....com; catalin.marinas@....com; will@...nel.org;
> >> rjw@...ysocki.net; vincent.guittot@...aro.org; lenb@...nel.org;
> >> gregkh@...uxfoundation.org; Jonathan Cameron
> <jonathan.cameron@...wei.com>;
> >> mingo@...hat.com; peterz@...radead.org; juri.lelli@...hat.com;
> >> dietmar.eggemann@....com; rostedt@...dmis.org; bsegall@...gle.com;
> >> mgorman@...e.de; mark.rutland@....com; sudeep.holla@....com;
> >> aubrey.li@...ux.intel.com; linux-arm-kernel@...ts.infradead.org;
> >> linux-kernel@...r.kernel.org; linux-acpi@...r.kernel.org;
> >> linuxarm@...neuler.org; xuwei (O) <xuwei5@...wei.com>; Zengtao (B)
> >> <prime.zeng@...ilicon.com>; tiantao (H) <tiantao6@...ilicon.com>
> >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
> and
> >> add cluster scheduler
> >>
> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>> On 1/6/21 12:30 AM, Barry Song wrote:
> >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>>> has local L3 tag. On the other hand, each cluster will share some
> >>>> internal system bus. This means cache is much more affine inside one cluster
> >>>> than across clusters.
> >>>
> >>> There is a similar need for clustering in x86.  Some x86 cores could share
> >> L2 caches that
> >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
> clusters
> >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
> >> L3).
> >>> Having a sched domain at the L2 cluster helps spread load among
> >>> L2 domains.  This will reduce L2 cache contention and help with
> >>> performance for low to moderate load scenarios.
> >>
> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> >> between L2 caches while Barry is after consolidating tasks within the
> >> boundaries of a L3 tag cache. One helps cache utilization, the other
> >> communication latency between tasks. Am I missing something?
> >
> > Morten, this is not true.
> >
> > we are both actually looking for the same behavior. My patch also
> > has done the exact same behavior of spreading with Tim's patch.
> 
> That's the case for the load-balance path because of the extra Sched
> Domain (SD) (CLS/MC_L2) below MC.
> 
> But in wakeup you add code which leads to a different packing strategy.

Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().

> 
> It looks like that Tim's workload (SPECrate mcf) shows a performance
> boost solely because of the changes the additional MC_L2 SD introduces
> in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
> have to carefully distinguish between packing vs. spreading in wakeup
> and load-balance here.
> 
> > Considering the below two cases:
> > Case 1. we have two tasks without any relationship running in a system with
> 2 clusters and 8 cpus.
> >
> > Without the sched_domain of cluster, these two tasks might be put as below:
> > +-------------------+            +-----------------+
> > | +----+   +----+   |            |                 |
> > | |task|   |task|   |            |                 |
> > | |1   |   |2   |   |            |                 |
> > | +----+   +----+   |            |                 |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> > With the sched_domain of cluster, load balance will spread them as below:
> > +-------------------+            +-----------------+
> > | +----+            |            | +----+          |
> > | |task|            |            | |task|          |
> > | |1   |            |            | |2   |          |
> > | +----+            |            | +----+          |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> >
> > Then task1 and tasks2 get more cache and decrease cache contention.
> > They will get better performance.
> >
> > That is what my original patch also can make. And tim's patch
> > is also doing. Once we add a sched_domain, load balance will
> > get involved.
> >
> >
> > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
> > But they are working in 4 groups:
> > Task1 wakes up task4
> > Task2 wakes up task5
> > Task3 wakes up task6
> > Task4 wakes up task7
> >
> > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
> > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters
> rather
> > than putting all of them in the random one of the 8 cpus. However, the 8 tasks
> > are still spreading among the 8 cpus with my change in select_idle_sibling
> > as load balance is still working.
> >
> > +---------------------------+    +----------------------+
> > | +----+        +-----+     |    | +----+      +-----+  |
> > | |task|        |task |     |    | |task|      |task |  |
> > | |1   |        | 4   |     |    | |2   |      |5    |  |
> > | +----+        +-----+     |    | +----+      +-----+  |
> > |                           |    |                      |
> > |       cluster1            |    |     cluster2         |
> > |                           |    |                      |
> > |                           |    |                      |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > | |task |       | task |    |    | |task |     |task  | |
> > | |3    |       |  6   |    |    | |4    |     |8     | |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > +---------------------------+    +----------------------+
> 
> Your use-case (#tasks, runtime/period) seems to be perfectly crafted to
> show the benefit of your patch on your specific system (cluster-size =
> 4). IMHO, this extra infrastructure especially in the wakeup path should
> show benefits over a range of different benchmarks.
> 
> > Let's consider the 3rd case, that one would be more tricky:
> >
> > task1 and task2 have close relationship and they are waker-wakee pair.
> > With my current patch, select_idle_sidling() wants to put them in one
> > cluster, load balance wants to put them in two clusters. Load balance will
> win.
> > Then maybe we need some same mechanism like adjusting numa imbalance:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> kernel/sched/fair.c?id=b396f52326de20
> > if we permit a light imbalance between clusters, select_idle_sidling()
> > will win. And task1 and task2 get better cache affinity.
> 
> This would look weird to allow this kind of imbalance on CLS (MC_L2) and
> NUMA domains but not on the MC domain for example.

Yes. I guess I actually meant permitting imbalance between sched_group
made by the child sched_cluster domain of the parent sched_mc domain.

sched_mc domain

+----------------------------------+
|   +--------+     +----------+    |
|   |sched_  |     |sched_    |    |
|   |group   |     |group     |    |
|   +--+-----+     +----+-----+    |
|      |  allow small   |          |
|      |  imbalance     |          |
+----------------------------------+
       |                |
       |                |
       |                |
       |                |
       |                |
       +                +
   child domain:     child domain:
   sched_cluster     sched_cluster

For sched_group within one sched_cluster domain, we don't allow this
kind of imbalance.

Anyway, I would be happier to see this kind of imbalance is
only allowed when we exactly know two tasks in the cluster
have wake-up relationship. Right now, SD_NUMA seems to be
simply allowing this imbalance without the knowledge of the
relationships of tasks causing imbalance.

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ