lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 26 Mar 2015 13:28:33 +0530
From:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
To:	peterz@...radead.org, mingo@...nel.org
Cc:	riel@...hat.com, daniel.lezcano@...aro.org,
	vincent.guittot@...aro.org, srikar@...ux.vnet.ibm.com,
	pjt@...gle.com, benh@...nel.crashing.org, efault@....de,
	linux-kernel@...r.kernel.org, iamjoonsoo.kim@....com,
	svaidy@...ux.vnet.ibm.com, tim.c.chen@...ux.intel.com,
	morten.rasmussen@....com, jason.low2@...com
Subject: [PATCH] sched: Improve load balancing in the presence of idle CPUs

When a CPU is kicked to do nohz idle balancing, it wakes up to do load balancing
on itself, followed by load balancing on behalf of idle CPUs. But it may end
up with load after the load balancing attempt on itself. This aborts nohz
idle balancing. As a result several idle CPUs are left without tasks till such a
time that an ILB CPU finds it unfavorable to pull tasks upon itself. This delays
spreading of load across idle CPUs and worse, clutters only a few CPUs with
tasks.

The effect of the above problem was observed on an SMT8 POWER server with 2 levels
of numa domains. Busy loops equal to number of cores were spawned. Since load balancing
on fork/exec is discouraged across numa domains, all busy loops would start on one of
the numa domains. However it was expected that eventually one busy loop would run per
core across all domains due to nohz idle load balancing. But it was observed that
it took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the following:

1. An ILB CPU was chosen from the first numa domain to trigger nohz idle load balancing
[Given the experiment, upto 6 CPUs per core could be potentially idle in this domain.]

2. However the ILB CPU would call load_balance() on itself before initiating nohz idle
load balancing.

3. Given cores are SMT8, the ILB CPU had enough opportunities to pull tasks from its
sibling cores to even out load.

4. Now that the ILB CPU was no longer idle, it would abort nohz idle load balancing

As a result the opportunities to spread load across numa domains were lost until such a
time that the cores within the first numa domain had equal number of tasks among themselves.
This is a pretty bad scenario, since the cores within the first numa domain would have
as many as 4 tasks each, while cores in the neighbouring numa domains would all remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load balancing, before it does
load balancing upon itself. This way we allow idle CPUs across the system to do load balancing
which results in quicker spread of load, instead of performing load balancing within the local
sched domain hierarchy of the ILB CPU alone under circumstances such as above.

Signed-off-by: Preeti U Murthy <preeti@...ux.vnet.ibm.com>
---

 kernel/sched/fair.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcfe320..95b00d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7660,14 +7660,13 @@ static void run_rebalance_domains(struct softirq_action *h)
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
-	rebalance_domains(this_rq, idle);
-
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
 	nohz_idle_balance(this_rq, idle);
+	rebalance_domains(this_rq, idle);
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ