lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4d1f063504b1420c9f836d1f1a7f8e77@hisilicon.com>
Date:   Mon, 3 May 2021 11:35:18 +0000
From:   "Song Bao Hua (Barry Song)" <song.bao.hua@...ilicon.com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>
CC:     "tim.c.chen@...ux.intel.com" <tim.c.chen@...ux.intel.com>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "bp@...en8.de" <bp@...en8.de>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "lenb@...nel.org" <lenb@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "bsegall@...gle.com" <bsegall@...gle.com>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "msys.mizuma@...il.com" <msys.mizuma@...il.com>,
        "valentin.schneider@....com" <valentin.schneider@....com>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "mark.rutland@....com" <mark.rutland@....com>,
        "sudeep.holla@....com" <sudeep.holla@....com>,
        "aubrey.li@...ux.intel.com" <aubrey.li@...ux.intel.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>, "xuwei (O)" <xuwei5@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        "guodong.xu@...aro.org" <guodong.xu@...aro.org>,
        yangyicong <yangyicong@...wei.com>,
        "Liguozhu (Kenneth)" <liguozhu@...ilicon.com>,
        "linuxarm@...neuler.org" <linuxarm@...neuler.org>,
        "hpa@...or.com" <hpa@...or.com>
Subject: RE: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks
 within one LLC



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Monday, May 3, 2021 6:12 PM
> To: 'Dietmar Eggemann' <dietmar.eggemann@....com>; Vincent Guittot
> <vincent.guittot@...aro.org>
> Cc: tim.c.chen@...ux.intel.com; catalin.marinas@....com; will@...nel.org;
> rjw@...ysocki.net; bp@...en8.de; tglx@...utronix.de; mingo@...hat.com;
> lenb@...nel.org; peterz@...radead.org; rostedt@...dmis.org;
> bsegall@...gle.com; mgorman@...e.de; msys.mizuma@...il.com;
> valentin.schneider@....com; gregkh@...uxfoundation.org; Jonathan Cameron
> <jonathan.cameron@...wei.com>; juri.lelli@...hat.com; mark.rutland@....com;
> sudeep.holla@....com; aubrey.li@...ux.intel.com;
> linux-arm-kernel@...ts.infradead.org; linux-kernel@...r.kernel.org;
> linux-acpi@...r.kernel.org; x86@...nel.org; xuwei (O) <xuwei5@...wei.com>;
> Zengtao (B) <prime.zeng@...ilicon.com>; guodong.xu@...aro.org; yangyicong
> <yangyicong@...wei.com>; Liguozhu (Kenneth) <liguozhu@...ilicon.com>;
> linuxarm@...neuler.org; hpa@...or.com
> Subject: RE: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks
> within one LLC
> 
> 
> 
> > -----Original Message-----
> > From: Dietmar Eggemann [mailto:dietmar.eggemann@....com]
> > Sent: Friday, April 30, 2021 10:43 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@...ilicon.com>; Vincent Guittot
> > <vincent.guittot@...aro.org>
> > Cc: tim.c.chen@...ux.intel.com; catalin.marinas@....com; will@...nel.org;
> > rjw@...ysocki.net; bp@...en8.de; tglx@...utronix.de; mingo@...hat.com;
> > lenb@...nel.org; peterz@...radead.org; rostedt@...dmis.org;
> > bsegall@...gle.com; mgorman@...e.de; msys.mizuma@...il.com;
> > valentin.schneider@....com; gregkh@...uxfoundation.org; Jonathan Cameron
> > <jonathan.cameron@...wei.com>; juri.lelli@...hat.com;
> mark.rutland@....com;
> > sudeep.holla@....com; aubrey.li@...ux.intel.com;
> > linux-arm-kernel@...ts.infradead.org; linux-kernel@...r.kernel.org;
> > linux-acpi@...r.kernel.org; x86@...nel.org; xuwei (O) <xuwei5@...wei.com>;
> > Zengtao (B) <prime.zeng@...ilicon.com>; guodong.xu@...aro.org; yangyicong
> > <yangyicong@...wei.com>; Liguozhu (Kenneth) <liguozhu@...ilicon.com>;
> > linuxarm@...neuler.org; hpa@...or.com
> > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks
> > within one LLC
> >
> > On 29/04/2021 00:41, Song Bao Hua (Barry Song) wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Dietmar Eggemann [mailto:dietmar.eggemann@....com]
> >
> > [...]
> >
> > >>>>> From: Dietmar Eggemann [mailto:dietmar.eggemann@....com]
> > >>
> > >> [...]
> > >>
> > >>>>> On 20/04/2021 02:18, Barry Song wrote:
> >
> > [...]
> >
> > > Though we will never go to slow path, wake_wide() will affect want_affine,
> > > so eventually affect the "new_cpu"?
> >
> > yes.
> >
> > >
> > > 	for_each_domain(cpu, tmp) {
> > > 		/*
> > > 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
> > > 		 * cpu is a valid SD_WAKE_AFFINE target.
> > > 		 */
> > > 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > > 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > > 			if (cpu != prev_cpu)
> > > 				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
> > >
> > > 			sd = NULL; /* Prefer wake_affine over balance flags */
> > > 			break;
> > > 		}
> > >
> > > 		if (tmp->flags & sd_flag)
> > > 			sd = tmp;
> > > 		else if (!want_affine)
> > > 			break;
> > > 	}
> > >
> > > If wake_affine is false, the above won't execute, new_cpu(target) will
> > > always be "prev_cpu"? so when task size > cluster size in wake_wide(),
> > > this means we won't pull the wakee to the cluster of waker? It seems
> > > sensible.
> >
> > What is `task size` here?
> >
> > The criterion is `!(slave < factor || master < slave * factor)` or
> > `slave >= factor && master >= slave * factor` to wake wide.
> >
> 
> Yes. For "task size", I actually mean a bundle of waker-wakee tasks
> which can make "slave >= factor && master >= slave * factor" either
> true or false, then change the target cpu where we are going to scan
> from.
> Now since I have moved to cluster level when tasks have been in same
> LLC level, it seems it would be more sensible to use "cluster_size" as
> factor?
> 
> > I see that since you effectively change the sched domain size from LLC
> > to CLUSTER (e.g. 24->6) for wakeups with cpu and prev_cpu sharing LLC
> > (hence the `numactl -N 0` in your workload), wake_wide() has to take
> > CLUSTER size into consideration.
> >
> > I was wondering if you saw wake_wide() returning 1 with your use cases:
> >
> > numactl -N 0 /usr/lib/lmbench/bin/stream -P [6,12] -M 1024M -N 5
> 
> I couldn't make wake_wide return 1 by the above stream command.
> And I can't reproduce it by a 1:1(monogamous) hackbench "-f 1".
> 
> But I am able to reproduce this issue by a M:N hackbench, for example:
> 
> numactl -N 0 hackbench -p -T -f 10 -l 20000 -g 1
> 
> hackbench will create 10 senders which will send messages to 10
> receivers. (Each sender can send messages to all 10 receivers.)
> 
> I've often seen flips like:
> waker wakee
> 1501  39
> 1509  17
> 11   1320
> 13   2016
> 
> 11, 13, 17 is smaller than LLC but larger than cluster. So the wake_wide()
> using cluster factor will return 1, on the other hand, if we always use
> llc_size as factor, it will return 0.
> 
> However, it seems the change in wake_wide() could bring some negative
> influence to M:N relationship(-f 10) according to tests made today by:
> 
> numactl -N 0 hackbench -p -T -f 10 -l 20000 -g $1
> 
> g             =      1     2       3       4
> cluster_size     0.5768 0.6578  0.8117 1.0119
> LLC_size         0.5479 0.6162  0.6922 0.7754
> 
> Always using llc_size as factor in wake_wide still shows better result
> in the 10:10 polygamous hackbench.
> 
> So it seems the `slave >= factor && master >= slave * factor` isn't
> a suitable criterion for cluster size?

On the other hand, according to "sched: Implement smarter wake-affine logic"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62470419

Proper factor in wake_wide is mainly beneficial of 1:n tasks like postgresql/pgbench.
So using the smaller cluster size as factor might help make wake_affine false so
improve pgbench.

From the commit log, while clients =  2*cpus, the commit made the biggest
improvement. In my case, It should be clients=48 for a machine whose LLC
size is 24.

In Linux, I created a 240MB database and ran "pgbench -c 48 -S -T 20 pgbench"
under two different scenarios:
1. page cache always hit, so no real I/O for database read
2. echo 3 > /proc/sys/vm/drop_caches

For case 1, using cluster_size and using llc_size will result in similar
tps= ~108000, all of 24 cpus have 100% cpu utilization.

For case 2, using llc_size still shows better performance.

tps for each test round(cluster size as factor in wake_wide):
1398.450887 1275.020401 1632.542437 1412.241627 1611.095692 1381.354294 1539.877146
avg tps = 1464

tps for each test round(llc size as factor in wake_wide):
1718.402983 1443.169823 1502.353823 1607.415861 1597.396924 1745.651814 1876.802168
avg tps = 1641  (+12%)

so it seems using cluster_size as factor in "slave >= factor && master >= slave *
factor" isn't a good choice for my machine at least.

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ