linux-kernel - [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 04 Jun 2012 11:27:15 +0530
From:	Prashanth Nageshappa <prashanth@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
	LKML <linux-kernel@...r.kernel.org>, roland@...nel.org,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>, efault@....de,
	Ingo Molnar <mingo@...e.hu>
Subject: [PATCH] sched: balance_cpu to consider other cpus in its group as
 target of (pinned) task migration

Based on the description in
http://marc.info/?l=linux-kernel&m=133108682113018&w=2 , I was able to recreate
a problem where in a SCHED_OTHER thread never gets runtime, even though there is
one allowed CPU where it can run and make progress.

On a dual socket box (4 cores per socket, 2 threads per core) with following
config:
0 8	1 9	4 12	5 13
2 10	3 11	6 14	7 15
|__________|    |__________|
 socket 1        socket 2

If we have following 4 tasks (2 SCHED_FIFO and 2 SCHED_OTHER) started in the
following order:
1> SCHED_FIFO cpu hogging task bound to cpu 1
2> SCHED_OTHER cpu hogging task bound to cpus 3 & 9 - running on cpu 3
   sleeps and wakes up after all other tasks are started
3> SCHED_FIFO cpu hogging task bound to cpu 3
4> SCHED_OTHER cpu hogging task bound to cpu 9

Once all the 4 tasks are started, we observe that 2nd task is starved of CPU
after waking up. When it wakes up, it wakes up on its prev_cpu (3) where
a FIFO task is now hogging the cpu. To prevent starvation, 2nd task
needs to be pulled to cpu 9. However, between cpus 1, 9, cpu1 is the chosen
cpu that attempts pulling tasks towards its core. When it tries pulling
2nd tasks towards its core, it is unable to do so as cpu1 is not in 2nd
task's cpus_allowed mask. Ideally cpu1 should note that the task can be
moved to its sibling and trigger that movement.

In this patch, we try to identify if load balancing goal was not achieved
completely because of destination cpu not being in cpus_allowed mask of target
task(s) and retry load balancing to try and move tasks to other cpus in the
same sched_group as that of destination cpu.

Tested on tip commit cca44889.

Signed-off-by: Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@...ux.vnet.ibm.com>

----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de49ed5..da275d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NEW_DST_CPU	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3108,6 +3109,8 @@ struct lb_env {
 	int			dst_cpu;
 	struct rq		*dst_rq;
 
+	struct cpumask		*dst_grpmask;
+	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
 	unsigned int		flags;
@@ -3198,7 +3201,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) are cache-hot on their current CPU.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
-		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+		int new_dst_cpu;
+
+		if (!env->dst_grpmask) {
+			schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+			return 0;
+		}
+		/*
+		 * check if cpus_allowed has any cpus in the same sched_group
+		 * as that of dst_cpu and set LBF_NEW_DST_CPU and new_dst_cpu
+		 * accordingly
+		 */
+		new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+						tsk_cpus_allowed(p));
+		if (new_dst_cpu >= nr_cpu_ids) {
+			schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+		} else {
+			env->flags |= LBF_NEW_DST_CPU;
+			env->new_dst_cpu = new_dst_cpu;
+		}
 		return 0;
 	}
 	env->flags &= ~LBF_ALL_PINNED;
@@ -4418,7 +4439,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *balance)
 {
-	int ld_moved, active_balance = 0;
+	int ld_moved, old_ld_moved, active_balance = 0;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -4428,6 +4449,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.sd		    = sd,
 		.dst_cpu	    = this_cpu,
 		.dst_rq		    = this_rq,
+		.dst_grpmask	    = sched_group_cpus(sd->groups),
 		.idle		    = idle,
 		.loop_break	    = sched_nr_migrate_break,
 		.find_busiest_queue = find_busiest_queue,
@@ -4461,6 +4483,7 @@ redo:
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
 	ld_moved = 0;
+	old_ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
@@ -4488,12 +4511,27 @@ more_balance:
 			env.flags &= ~LBF_NEED_BREAK;
 			goto more_balance;
 		}
-
 		/*
 		 * some other cpu did the load balance for us.
 		 */
-		if (ld_moved && this_cpu != smp_processor_id())
-			resched_cpu(this_cpu);
+		if ((ld_moved != old_ld_moved) &&
+			(env.dst_cpu != smp_processor_id()))
+			resched_cpu(env.dst_cpu);
+
+		if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
+			/*
+			 * we could not balance completely as some tasks
+			 * were not allowed to move to the dst_cpu, so try
+			 * again with new_dst_cpu.
+			 */
+			this_rq = cpu_rq(env.new_dst_cpu);
+			env.dst_rq = this_rq;
+			env.dst_cpu = env.new_dst_cpu;
+			env.flags &= ~LBF_NEW_DST_CPU;
+			env.loop = 0;
+			old_ld_moved = ld_moved;
+			goto more_balance;
+		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
@@ -4694,6 +4732,7 @@ static int active_load_balance_cpu_stop(void *data)
 			.sd		= sd,
 			.dst_cpu	= target_cpu,
 			.dst_rq		= target_rq,
+			.dst_grpmask	= NULL,
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/