linux-kernel - Re: [PATCH v2] sched/fair: limit sched slice duration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzXEZnH6CAtZ+V4R@hirez.programming.kicks-ass.net>
Date:   Thu, 29 Sep 2022 18:14:30 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     mingo@...hat.com, juri.lelli@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com,
        linux-kernel@...r.kernel.org, zhangqiao22@...wei.com
Subject: Re: [PATCH v2] sched/fair: limit sched slice duration

On Fri, Sep 16, 2022 at 03:15:38PM +0200, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
> 
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
> 
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
> 
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
> 
> Cap the ideal_runtime to the weighted version of sysctl_sched_latency when
> comparing with the vruntime of the next waiting task to make sure that
> tasks will regularly make progress and will not be significantly impacted
> by idle/background tasks queued on the rq.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---
> 
> I have kept the test if (delta < 0) as calc_delta_fair() can't handle negative
> value.
> 
> Change since v1:
>   - the first 3 patches have been already queued
>   - use the weight of curr to scale sysctl_sched_latency before capping
>     the ideal_runtime so we can compare vruntime values.
> 
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..ba451bb25929 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4610,6 +4610,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	if (delta < 0)
>  		return;
>  
> +	ideal_runtime = min_t(u64, ideal_runtime,
> +				   calc_delta_fair(sysctl_sched_latency, curr));
>  	if (delta > ideal_runtime)
>  		resched_curr(rq_of(cfs_rq));
>  }

Since I'm suffering from a cold and constant interruptions I had to
write down my thinking and ended up with the below.

Does that make sense or did I go sideways somewhere (entirely possible).

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..2b218167fadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 }
 
 /*
- * Preempt the current task with a newly woken task if needed:
+ * Tick driven preemption; preempt the task if it has ran long enough.
+ * Allows other tasks to have a go.
  */
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
 	struct sched_entity *se;
-	s64 delta;
+	s64 delta, delta_exec;
+	u64 ideal_runtime;
 
-	ideal_runtime = sched_slice(cfs_rq, curr);
+	/* How long has this task been on the CPU for [walltime]. */
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+	/*
+	 * Ensure that a task that missed wakeup preemption by a
+	 * narrow margin doesn't have to wait for a full slice.
+	 * This also mitigates buddy induced latencies under load.
+	 */
+	if (delta_exec < sysctl_sched_min_granularity)
+		return;
+
+	/*
+	 * When many tasks blow up the sched_period; it is possible that
+	 * sched_slice() reports unusually large results (when many tasks are
+	 * very light for example). Therefore impose a maximum.
+	 */
+	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
 	if (delta_exec > ideal_runtime) {
 		resched_curr(rq_of(cfs_rq));
 		/*
@@ -4597,19 +4613,24 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	}
 
 	/*
-	 * Ensure that a task that missed wakeup preemption by a
-	 * narrow margin doesn't have to wait for a full slice.
-	 * This also mitigates buddy induced latencies under load.
+	 * Strictly speaking the above is sufficient; however due to
+	 * imperfections it is possible to have a leftmost task left of
+	 * min_vruntime.
+	 *
+	 * Also impose limits on the delta in vtime.
 	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
 
 	se = __pick_first_entity(cfs_rq);
 	delta = curr->vruntime - se->vruntime;
-
 	if (delta < 0)
 		return;
 
+	/*
+	 * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
+	 * heavy tasks (for which vtime goes slower) get relatively more time
+	 * before preemption, while light tasks (for which vtime goes faster)
+	 * get relatively less time.  IOW, heavy task get to run longer.
+	 */
 	if (delta > ideal_runtime)
 		resched_curr(rq_of(cfs_rq));
 }