linux-kernel - [PATCH 4/5] sched: don't consider upper se in sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1364457537-15114-5-git-send-email-iamjoonsoo.kim@lge.com>
Date:	Thu, 28 Mar 2013 16:58:55 +0900
From:	Joonsoo Kim <iamjoonsoo.kim@....com>
To:	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Cc:	linux-kernel@...r.kernel.org, Mike Galbraith <efault@....de>,
	Paul Turner <pjt@...gle.com>, Alex Shi <alex.shi@...el.com>,
	Preeti U Murthy <preeti@...ux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Namhyung Kim <namhyung@...nel.org>,
	Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Following-up upper se in sched_slice() should not be done,
because sched_slice() is used for checking that resched is needed
whithin *this* cfs_rq and there is one problem related to this
in current implementation.

The problem is that if we follow-up upper se in sched_slice(), it is
possible that we get a ideal slice which is lower than
sysctl_sched_min_granularity.

For example, we assume that we have 4 tg which is attached to root tg
with same share and each one have 20 runnable tasks on cpu0, respectivly.
In this case, __sched_period() return sysctl_sched_min_granularity * 20
and then go into loop. At first iteration, we compute a portion of slice
for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
Afterward, we enter second iteration and get a slice which is a quarter of
sysctl_sched_min_granularity, because there is 4 tgs with same share
in that cfs_rq.

Ensuring slice larger than min_granularity is important for performance
and there is no lower bound about this, except timer tick, we should
fix it not to consider upper se when calculating sched_slice.

Below is my testing result on my 4 cpus machine.

I did a test for verifying this effect in below environment.

CONFIG_HZ=1000 and CONFIG_SCHED_AUTOGROUP=y
/proc/sys/kernel/sched_min_granularity_ns is 2250000, that is, 2.25ms.

Did following command.

For each 4 sessions,
for i in `seq 20`; do taskset -c 3 sh -c 'while true; do :; done' & done

./perf sched record
./perf script -C 003 | grep sched_switch | cut -b -40 | less

Result is below.

*Vanilla*
              sh  2724 [003]   152.52801
              sh  2779 [003]   152.52900
              sh  2775 [003]   152.53000
              sh  2751 [003]   152.53100
              sh  2717 [003]   152.53201

*With this patch*
              sh  2640 [003]   147.48700
              sh  2662 [003]   147.49000
              sh  2601 [003]   147.49300
              sh  2633 [003]   147.49400

In vanilla case, min_granularity is lower than 1ms, so every tick trigger
reschedule. After patch appied, we can see min_granularity is ensured.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 204a9a9..e232421 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -631,23 +631,20 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct load_weight *load;
+	struct load_weight lw;
 	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
-
-		cfs_rq = cfs_rq_of(se);
-		load = &cfs_rq->load;
+	load = &cfs_rq->load;

-		if (unlikely(!se->on_rq)) {
-			lw = cfs_rq->load;
+	if (unlikely(!se->on_rq)) {
+		lw = cfs_rq->load;

-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = calc_delta_mine(slice, se->load.weight, load);
+		update_load_add(&lw, se->load.weight);
+		load = &lw;
 	}
+	slice = calc_delta_mine(slice, se->load.weight, load);
+
 	return slice;
 }

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/