linux-kernel - [PATCH v3] sched: modify how to compute a slice and check a preemptability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <1437297060-25378-1-git-send-email-byungchul.park@lge.com>
Date:	Sun, 19 Jul 2015 18:11:00 +0900
From:	byungchul.park@....com
To:	mingo@...nel.org, peterz@...radead.org
Cc:	linux-kernel@...r.kernel.org,
	Byungchul Park <byungchul.park@....com>
Subject: [PATCH v3] sched: modify how to compute a slice and check a preemptability

From: Byungchul Park <byungchul.park@....com>

hello all,

i asked a question like below, in last version(=v2) patch.

***

the sysctl_sched_min_granularity must be defined clearly at first. after
defining that clearly, the way to work can be set. the definition can
be either case 1 or case 2 below.

case 1. any task must have at least sysctl_sched_min_granularity slice, which
is currently 0.75ms. in this case, increasing the number of tasks in a rq can
cause stretching a whole latency, which most of you don't like because it can
stretch the whole latency too much. but it looks normal to me since it already
happens in !CONFIG_FAIR_GROUP_SCHED world with the large number of tasks.
i wonder why CONFIG_FAIR_GROUP_SCHED world must be different with
!CONFIG_FAIR_GROUP_SCHED world? anyway...

case 2. tasks can have a slice much smaller than sysctl_sched_min_granularity,
according to the position in hierarchy. if a rq has 8 same weighted sched
entities and each entities has 8 same weighted sched entities and do it one
more, then a task can have a very small slice, e.g. 0.75ms / 64 ~ 0.01ms.
if you add more level to cgroup, it would get worse. in this situation,
context switching overhead becomes very large. what does it mean
sysctl_sched_min_granularity here? anyway...

i am not sure which is the right definition of sysctl_sched_min_granularity
between case 1 and case 2. what do you think about this?

***

i wrote this v3 patch based on the case 1 assuming the case 1 is right.
if the case 2 is right, then modifications in check_preempt_tick() should
be ignored.

doesn't it make sense?

thank you,
byungchul

---------------->8----------------
>From 7ebce566af9b952d24494cd1258b481ec6639cc1 Mon Sep 17 00:00:00 2001
From: Byungchul Park <byungchul.park@....com>
Date: Sun, 19 Jul 2015 17:11:37 +0900
Subject: [PATCH v3] sched: modify how to compute a slice and check a
 preemptability

make cfs scheduler use rq level nr_running to compute a period in the case
of CONFIG_FAIR_GROUP_SCHED. using local cfs's nr_running to get period is
very weird. for example, imagine cgroup structure below.

root(=rq.cfs)--group1----a
                     |---b
                     |---c
                     |---d
                     |---e
                     |---f
                     |---g
                     |---h
                     |---i
                     |---j
                     |---k
                     |---l
                     |---m

in this case, group1's slice is not comparable to (a's slice + ... + m's
slice) with current code. it makes code using sum_exec_runtime weird, too.
it happens since current code does not use a consistent global wide thing
to get a global wide period.

in addition, modify preempt checking code to ensure that a sched entity
has at least sysctl_sched_min_granularity granularity for preemption.

Signed-off-by: Byungchul Park <byungchul.park@....com>
---
 kernel/sched/fair.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09456fc..41c619f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -635,7 +635,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+	u64 slice = __sched_period(rq_of(cfs_rq)->cfs.nr_running + !se->on_rq);

 	for_each_sched_entity(se) {
 		struct load_weight *load;
@@ -3226,6 +3226,12 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	struct sched_entity *se;
 	s64 delta;

+	/*
+	 * Ensure that a task executes at least for sysctl_sched_min_granularity
+	 */
+	if (delta_exec < sysctl_sched_min_granularity)
+		return;
+
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
@@ -3243,9 +3249,6 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * narrow margin doesn't have to wait for a full slice.
 	 * This also mitigates buddy induced latencies under load.
 	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
-
 	se = __pick_first_entity(cfs_rq);
 	delta = curr->vruntime - se->vruntime;

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/