[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <VI1PR0501MB2110A7771BED1C46F0E44DA6B28B0@VI1PR0501MB2110.eurprd05.prod.outlook.com>
Date: Wed, 9 Aug 2017 15:19:02 +0000
From: "Ofer Levi(SW)" <oferle@...lanox.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: "rusty@...tcorp.com.au" <rusty@...tcorp.com.au>,
"mingo@...hat.com" <mingo@...hat.com>,
"Vineet.Gupta1@...opsys.com" <Vineet.Gupta1@...opsys.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Tejun Heo <tj@...nel.org>
Subject: RE: hotplug support for arch/arc/plat-eznps platform
I appreciate your effort and detailed reply, however I'm still experiencing performance hit at
partition_sched_domains(). It seems the issue is due to the large magnitude of cpus.
I used he suggested method 2, patched in the diffs and used the command line switch isolcpus to
kill load-balancing.
It did save few hundredth of a sec per cpu. When I limited number of available cpus
(using present and possible cpus ) to 48, it did reduced dramatically this function execution time:
With 4K available cpus :
[ 48.890000] ## CPU16 LIVE ##: Executing Code...
[ 48.910000] partition_sched_domains start
[ 49.360000] partition_sched_domains end
With 48 available cpus:
[ 36.950000] ## CPU16 LIVE ##: Executing Code...
[ 36.950000] partition_sched_domains start
[ 36.960000] partition_sched_domains end
Note that I currently use kernel version: 4.8.0.17.0600.00.0000, if this has any influence.
Would appreciate your thoughts.
Thanks
-Ofer
> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@...radead.org]
> Sent: Tuesday, August 8, 2017 1:16 PM
> To: Ofer Levi(SW) <oferle@...lanox.com>
> Cc: rusty@...tcorp.com.au; mingo@...hat.com;
> Vineet.Gupta1@...opsys.com; linux-kernel@...r.kernel.org; Tejun Heo
> <tj@...nel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
>
> On Tue, Aug 08, 2017 at 06:49:39AM +0000, Ofer Levi(SW) wrote:
>
> > The idea behind implementing hotplug for this arch is to shorten time
> > to traffic processing. This way instead of waiting ~5 min for all
> > cpus to boot, application running on cpu 0 will Loop booting other
> > cpus and assigning the traffic processing application to it.
> > Outgoing traffic will build up until all cpus are up and running full
> > traffic rate. This method allow for traffic processing to start after
> > ~20 sec instead of the 5 min.
>
> Ah, ok. So only online is ever used. Offline is a whole other can of worms.
>
> > > So how can boot be different than hot-pugging them?
> >
> > Please have a look at following code kernel/sched/core.c,
> sched_cpu_activate() :
> >
> > if (sched_smp_initialized) {
> > sched_domains_numa_masks_set(cpu);
> > cpuset_cpu_active();
> > }
>
> Ah, cute, I totally missed we did that. Yes that avoids endless domain rebuilds
> on boot.
>
> > The cpuset_cpu_active call eventually leads to the function in
> > question partition_sched_domains() When cold-booting cpus the
> > sched_smp_initialized flag is false and therefore
> > partition_sched_domains is not executing.
>
> So you're booting with "maxcpus=1" to only online the one. And then you
> want to online the rest once userspace runs.
>
> There's two possibilities. The one I prefer (but which appears the most
> broken with the current code) is using the cpuset controller.
>
> 1)
>
> Once you're up and running with a single CPU do:
>
> $ mkdir /cgroup
> $ mount none /cgroup -t cgroup -o cpuset
> $ echo 0 > /cgroup/cpuset.sched_load_balance
> $ for ((i=1;i<4096;i++))
> do
> echo 1 > /sys/devices/system/cpu/cpu$i/online;
> done
>
> And then, if you want load-balancing, you can re-enable it globally,
> or only on a subset of CPUs.
>
>
> 2)
>
> The alternative is to use "isolcpus=1-4095" to completely kill
> load-balancing. This more or less works with the current code,
> except that it will keep rebuilding the CPU0 sched-domain, which
> is somewhat pointless (also fixed by the below patch).
>
> The reason I don't particularly like this option is that its boot time
> only, you cannot reconfigure your system at runtime, but that might
> be good enough for you.
>
>
> With the attached patch, either option generates (I only have 40 CPUs):
>
> [ 44.305563] CPU0 attaching NULL sched-domain.
> [ 51.954872] SMP alternatives: switching to SMP code
> [ 51.976923] x86: Booting SMP configuration:
> [ 51.981602] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 52.057756] microcode: sig=0x306e4, pf=0x1, revision=0x416
> [ 52.064740] microcode: updated to revision 0x428, date = 2014-05-29
> [ 52.080854] smpboot: Booting Node 0 Processor 2 APIC 0x4
> [ 52.164124] smpboot: Booting Node 0 Processor 3 APIC 0x6
> [ 52.244615] smpboot: Booting Node 0 Processor 4 APIC 0x8
> [ 52.324564] smpboot: Booting Node 0 Processor 5 APIC 0x10
> [ 52.405407] smpboot: Booting Node 0 Processor 6 APIC 0x12
> [ 52.485460] smpboot: Booting Node 0 Processor 7 APIC 0x14
> [ 52.565333] smpboot: Booting Node 0 Processor 8 APIC 0x16
> [ 52.645364] smpboot: Booting Node 0 Processor 9 APIC 0x18
> [ 52.725314] smpboot: Booting Node 1 Processor 10 APIC 0x20
> [ 52.827517] smpboot: Booting Node 1 Processor 11 APIC 0x22
> [ 52.912271] smpboot: Booting Node 1 Processor 12 APIC 0x24
> [ 52.996101] smpboot: Booting Node 1 Processor 13 APIC 0x26
> [ 53.081239] smpboot: Booting Node 1 Processor 14 APIC 0x28
> [ 53.164990] smpboot: Booting Node 1 Processor 15 APIC 0x30
> [ 53.250146] smpboot: Booting Node 1 Processor 16 APIC 0x32
> [ 53.333894] smpboot: Booting Node 1 Processor 17 APIC 0x34
> [ 53.419026] smpboot: Booting Node 1 Processor 18 APIC 0x36
> [ 53.502820] smpboot: Booting Node 1 Processor 19 APIC 0x38
> [ 53.587938] smpboot: Booting Node 0 Processor 20 APIC 0x1
> [ 53.659828] microcode: sig=0x306e4, pf=0x1, revision=0x428
> [ 53.674857] smpboot: Booting Node 0 Processor 21 APIC 0x3
> [ 53.756346] smpboot: Booting Node 0 Processor 22 APIC 0x5
> [ 53.836793] smpboot: Booting Node 0 Processor 23 APIC 0x7
> [ 53.917753] smpboot: Booting Node 0 Processor 24 APIC 0x9
> [ 53.998717] smpboot: Booting Node 0 Processor 25 APIC 0x11
> [ 54.079674] smpboot: Booting Node 0 Processor 26 APIC 0x13
> [ 54.160636] smpboot: Booting Node 0 Processor 27 APIC 0x15
> [ 54.241592] smpboot: Booting Node 0 Processor 28 APIC 0x17
> [ 54.322553] smpboot: Booting Node 0 Processor 29 APIC 0x19
> [ 54.403487] smpboot: Booting Node 1 Processor 30 APIC 0x21
> [ 54.487676] smpboot: Booting Node 1 Processor 31 APIC 0x23
> [ 54.571921] smpboot: Booting Node 1 Processor 32 APIC 0x25
> [ 54.656508] smpboot: Booting Node 1 Processor 33 APIC 0x27
> [ 54.740835] smpboot: Booting Node 1 Processor 34 APIC 0x29
> [ 54.824466] smpboot: Booting Node 1 Processor 35 APIC 0x31
> [ 54.908374] smpboot: Booting Node 1 Processor 36 APIC 0x33
> [ 54.992322] smpboot: Booting Node 1 Processor 37 APIC 0x35
> [ 55.076333] smpboot: Booting Node 1 Processor 38 APIC 0x37
> [ 55.160249] smpboot: Booting Node 1 Processor 39 APIC 0x39
>
>
> ---
> Subject: sched,cpuset: Avoid spurious/wrong domain rebuilds
>
> When disabling cpuset.sched_load_balance we expect to be able to online
> CPUs without generating sched_domains. However this is currently
> completely broken.
>
> What happens is that we generate the sched_domains and then destroy
> them. This is because of the spurious 'default' domain build in
> cpuset_update_active_cpus(). That builds a single machine wide domain and
> then schedules a work to build the 'real' domains. The work then finds there
> are _no_ domains and destroys the lot again.
>
> Furthermore, if there actually were cpusets, building the machine wide
> domain is actively wrong, because it would allow tasks to 'escape' their
> cpuset. Also I don't think its needed, the scheduler really should respect the
> active mask.
>
> Also (this should probably be a separate patch) fix
> partition_sched_domains() to try and preserve the existing machine wide
> domain instead of unconditionally destroying it. We do this by attempting to
> allocate the new single domain, only when that fails to we reuse the
> fallback_doms.
>
> Cc: Tejun Heo <tj@...nel.org>
> Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
> kernel/cgroup/cpuset.c | 6 ------
> kernel/sched/topology.c | 15 ++++++++++++---
> 2 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index
> ca8376e5008c..e557cdba2350 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2342,13 +2342,7 @@ void cpuset_update_active_cpus(void)
> * We're inside cpu hotplug critical region which usually nests
> * inside cgroup synchronization. Bounce actual hotplug processing
> * to a work item to avoid reverse locking order.
> - *
> - * We still need to do partition_sched_domains() synchronously;
> - * otherwise, the scheduler will get confused and put tasks to the
> - * dead CPU. Fall back to the default single domain.
> - * cpuset_hotplug_workfn() will rebuild it as necessary.
> */
> - partition_sched_domains(1, NULL, NULL);
> schedule_work(&cpuset_hotplug_work);
> }
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index
> 79895aec281e..1b74b2cc5dba 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1854,7 +1854,17 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
> /* Let the architecture update CPU core mappings: */
> new_topology = arch_update_cpu_topology();
>
> - n = doms_new ? ndoms_new : 0;
> + if (!doms_new) {
> + WARN_ON_ONCE(dattr_new);
> + n = 0;
> + doms_new = alloc_sched_domains(1);
> + if (doms_new) {
> + n = 1;
> + cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> + }
> + } else {
> + n = ndoms_new;
> + }
>
> /* Destroy deleted domains: */
> for (i = 0; i < ndoms_cur; i++) {
> @@ -1870,11 +1880,10 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
> }
>
> n = ndoms_cur;
> - if (doms_new == NULL) {
> + if (!doms_new) {
> n = 0;
> doms_new = &fallback_doms;
> cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> - WARN_ON_ONCE(dattr_new);
> }
>
> /* Build new domains: */
Powered by blists - more mailing lists