linux-kernel - [PATCH] sched: new feature to spread tasks inside cpu-groups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <53B1151E.6030603@linux.vnet.ibm.com>
Date:	Mon, 30 Jun 2014 15:43:26 +0800
From:	Michael wang <wangyun@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>
CC:	Mike Galbraith <umgwanakikbuti@...il.com>,
	Rik van Riel <riel@...hat.com>,
	Alex Shi <alex.shi@...aro.org>, Paul Turner <pjt@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Daniel Lezcano <daniel.lezcano@...aro.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: [PATCH] sched: new feature to spread tasks inside cpu-groups

Recently testing show that the cpu-cgroup was failed on managing the mixed
workloads of dbench and stress, by doing:

	mkdir /cgroup/cpu/l1/
	mkdir /cgroup/cpu/l1/A
	mkdir /cgroup/cpu/l1/B
	mkdir /cgroup/cpu/l1/C

	echo $$ > /cgroup/cpu/l1/A/tasks ; dbench 6
	echo $$ > /cgroup/cpu/l1/B/tasks ; stress 6
	echo $$ > /cgroup/cpu/l1/C/tasks ; stress 6

although the cpu-shares was 1:1:1 (A:B:C), the CPU% was around 1:5:5. 

Now by doing:

	echo 102400 > /cgroup/cpu/l1/A/cpu.shares

the cpu-shares become 100:1:1, however, the CPU% was still around 1:5:5.

This testing could be extended to 10000:1:1 on cpu-shares or even more, the
CPU% was still around 1:5:5.

We used to think it was caused by that the dbench only need so many CPU% but
actually that's not true, after we bind each instances to different CPUs, we
could see the CPU% become 3:4:4 with only 10:1:1 on cpu-shares.

However, bind tasks to each CPU is definitely not a good solution, we need
some feature capable to spread tasks inside a group meanwhile following the
current scheduler logical.

This patch introduced a new feature which will meet these requirements, it will
locate idle cfs_rq inside cpu-group when and only when we are going to giveup
on searching idle-CPU, this make the tasks more actively on spreading inside
cpu-cgroup than usual.

Now by doing:

	echo SPREAD_INSIDE_GROUP > /sys/kernel/debug/sched_features

The 10:1:1 on cpu-shares will lead to 3:4:4 on CPU%, also the throughput of
dbench raised, so we finally got the way to help dbench(transaction workload)
to fight with stress(CPU-intensive workload).

CC: Ingo Molnar <mingo@...nel.org>
CC: Peter Zijlstra <peterz@...radead.org>
Signed-off-by: Michael Wang <wangyun@...ux.vnet.ibm.com>
---
 kernel/sched/fair.c     |   63 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |    8 ++++++
 2 files changed, 71 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..0e3022c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,51 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+	return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+	int i = task_cpu(p);
+	struct task_group *tg = task_group(p);
+
+	if (tg_idle_cpu(tg, target))
+		goto done;
+
+	sd = rcu_dereference(per_cpu(sd_llc, target));
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
+
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (i == target || !tg_idle_cpu(tg, i))
+					goto next;
+			}
+
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
+	}
+
+done:
+
+	return target;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -4417,6 +4462,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
+	struct sched_entity *se = task_group(p)->se[i];
 
 	if (idle_cpu(target))
 		return target;
@@ -4451,6 +4497,23 @@ next:
 		} while (sg != sd->groups);
 	}
 done:
+
+	if (!idle_cpu(target) && sched_feat(SPREAD_INSIDE_GROUP)) {
+		/*
+		 * Before we arbitrarily return the target, try to locate an
+		 * idle cfs_rq inside task's group with the same logical.
+		 *
+		 * This is try to prevent tasks from gathering, especially for
+		 * those wake-affine rapidly while being balanced rarely, wakeup
+		 * is the only chance to spreading them.
+		 *
+		 * We only need to take care the tasks flip frequently, and
+		 * load-balance routine will take care the others.
+		 */
+		if (p->wakee_flips > this_cpu_read(sd_llc_size))
+			return tg_idle_sibling(p, target);
+	}
+
 	return target;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..532d6e9 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -6,6 +6,14 @@
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Adopt the logical of select_idle_sibling() to pick idle cfs_rq
+ * inside task's cpu-group, this will help to spread the group's
+ * tasks internally and benefit to those who prefer balancing more
+ * than gathering.
+ */
+SCHED_FEAT(SPREAD_INSIDE_GROUP, false)
+
+/*
  * Place new tasks ahead so that they do not starve already running
  * tasks
  */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/