linux-kernel - RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9201b56a29dd4dacb7d9fcbf307ca5ff@hisilicon.com>
Date:   Tue, 13 Apr 2021 10:45:44 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Tim Chen <tim.c.chen@...ux.intel.com>
CC:     "valentin.schneider@....com" <valentin.schneider@....com>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "lenb@...nel.org" <lenb@...nel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "bsegall@...gle.com" <bsegall@...gle.com>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "mark.rutland@....com" <mark.rutland@....com>,
        "sudeep.holla@....com" <sudeep.holla@....com>,
        "aubrey.li@...ux.intel.com" <aubrey.li@...ux.intel.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linuxarm@...neuler.org" <linuxarm@...neuler.org>,
        "xuwei (O)" <xuwei5@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        "tiantao (H)" <tiantao6@...ilicon.com>,
        "Guodong Xu" <guodong.xu@...aro.org>,
        yangyicong <yangyicong@...wei.com>
Subject: RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
 add cluster scheduler



> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@....com]
> Sent: Wednesday, January 13, 2021 12:00 AM
> To: Morten Rasmussen <morten.rasmussen@....com>; Tim Chen
> <tim.c.chen@...ux.intel.com>
> Cc: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>;
> valentin.schneider@....com; catalin.marinas@....com; will@...nel.org;
> rjw@...ysocki.net; vincent.guittot@...aro.org; lenb@...nel.org;
> gregkh@...uxfoundation.org; Jonathan Cameron <jonathan.cameron@...wei.com>;
> mingo@...hat.com; peterz@...radead.org; juri.lelli@...hat.com;
> rostedt@...dmis.org; bsegall@...gle.com; mgorman@...e.de;
> mark.rutland@....com; sudeep.holla@....com; aubrey.li@...ux.intel.com;
> linux-arm-kernel@...ts.infradead.org; linux-kernel@...r.kernel.org;
> linux-acpi@...r.kernel.org; linuxarm@...neuler.org; xuwei (O)
> <xuwei5@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; tiantao (H)
> <tiantao6@...ilicon.com>
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 11/01/2021 10:28, Morten Rasmussen wrote:
> > On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
> >>
> >>
> >> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
> >>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>>> On 1/6/21 12:30 AM, Barry Song wrote:
> 
> [...]
> 
> >> I think it is going to depend on the workload.  If there are dependent
> >> tasks that communicate with one another, putting them together
> >> in the same cluster will be the right thing to do to reduce communication
> >> costs.  On the other hand, if the tasks are independent, putting them together
> on the same cluster
> >> will increase resource contention and spreading them out will be better.
> >
> > Agree. That is exactly where I'm coming from. This is all about the task
> > placement policy. We generally tend to spread tasks to avoid resource
> > contention, SMT and caches, which seems to be what you are proposing to
> > extend. I think that makes sense given it can produce significant
> > benefits.
> >
> >>
> >> Any thoughts on what is the right clustering "tag" to use to clump
> >> related tasks together?
> >> Cgroup? Pid? Tasks with same mm?
> >
> > I think this is the real question. I think the closest thing we have at
> > the moment is the wakee/waker flip heuristic. This seems to be related.
> > Perhaps the wake_affine tricks can serve as starting point?
> 
> wake_wide() switches between packing (select_idle_sibling(), llc_size
> CPUs) and spreading (find_idlest_cpu(), all CPUs).
> 
> AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
> all wakeups are (llc-)packed.
> 
>  select_task_rq_fair()
> 
>    for_each_domain(cpu, tmp)
> 
>      if (tmp->flags & sd_flag)
>        sd = tmp;
> 
> 
> In case we would like to further distinguish between llc-packing and
> even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
> packing vs. spreading heuristic further down in sis().
> 
> IMHO, Barry's current implementation doesn't do this right now. Instead
> he's trying to pack on cluster first and if not successful look further
> among the remaining llc CPUs for an idle CPU.

Right now in the main cases of using wake_affine to achieve
better performance, processes are actually bound within one
numa which is also a LLC in kunpeng920. 

Probably LLC=NUMA is also true for X86 Jacobsville, Tim?

So one possible way to pretend a 2-level packing might be:
if the affinity cpuset of waker and waker are both subset
of one same LLC, we totally use cluster as the factor to
determine packing or not and ignore LLC.

I haven't really done this, but the below code can make the
same result by forcing llc_id=cluster_id:

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index d72eb8d..3d78097 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -107,7 +107,7 @@ int __init parse_acpi_topology(void)
                cpu_topology[cpu].cluster_id = topology_id;
                topology_id = find_acpi_cpu_topology_package(cpu);
                cpu_topology[cpu].package_id = topology_id;
-
+#if 0
                i = acpi_find_last_cache_level(cpu);

                if (i > 0) {
@@ -119,8 +119,11 @@ int __init parse_acpi_topology(void)
                        if (cache_id > 0)
                                cpu_topology[cpu].llc_id = cache_id;
                }
-       }
+#else
+               cpu_topology[cpu].llc_id = cpu_topology[cpu].cluster_id;
+#endif

+       }
        return 0;
 }
 #endif

With this, I have seen some major improvement in hackbench especially
for monogamous communication model (fds_num=1, one sender for one
receiver):
numactl -N 0 hackbench -p -T -l 200000 -f 1 -g $1

I have tested -g(group_nums) 6, 12, 18, 24, 28, 32,
For each different g, I ran 20 times and got the
average value. The result is as below:

g=    6      12    18      24    28     32
w/o 1.3243 1.6741 1.7560 1.9036 2.0262 2.1826
w/  1.1314 1.1864 1.4494 1.6159 1.9078 2.1249

Using top -H and hit "f" to show cpu of each thread,
I am seeing the two threads in one group are likely
to run in a cluster. That's why the hackbench latency
is decreasing much.

Thanks
Barry