[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzXEZnH6CAtZ+V4R@hirez.programming.kicks-ass.net>
Date: Thu, 29 Sep 2022 18:14:30 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com, vschneid@...hat.com,
linux-kernel@...r.kernel.org, zhangqiao22@...wei.com
Subject: Re: [PATCH v2] sched/fair: limit sched slice duration
On Fri, Sep 16, 2022 at 03:15:38PM +0200, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
>
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
>
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
>
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
>
> Cap the ideal_runtime to the weighted version of sysctl_sched_latency when
> comparing with the vruntime of the next waiting task to make sure that
> tasks will regularly make progress and will not be significantly impacted
> by idle/background tasks queued on the rq.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---
>
> I have kept the test if (delta < 0) as calc_delta_fair() can't handle negative
> value.
>
> Change since v1:
> - the first 3 patches have been already queued
> - use the weight of curr to scale sysctl_sched_latency before capping
> the ideal_runtime so we can compare vruntime values.
>
> kernel/sched/fair.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..ba451bb25929 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4610,6 +4610,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> if (delta < 0)
> return;
>
> + ideal_runtime = min_t(u64, ideal_runtime,
> + calc_delta_fair(sysctl_sched_latency, curr));
> if (delta > ideal_runtime)
> resched_curr(rq_of(cfs_rq));
> }
Since I'm suffering from a cold and constant interruptions I had to
write down my thinking and ended up with the below.
Does that make sense or did I go sideways somewhere (entirely possible).
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..2b218167fadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
}
/*
- * Preempt the current task with a newly woken task if needed:
+ * Tick driven preemption; preempt the task if it has ran long enough.
+ * Allows other tasks to have a go.
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
- unsigned long ideal_runtime, delta_exec;
struct sched_entity *se;
- s64 delta;
+ s64 delta, delta_exec;
+ u64 ideal_runtime;
- ideal_runtime = sched_slice(cfs_rq, curr);
+ /* How long has this task been on the CPU for [walltime]. */
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+ /*
+ * Ensure that a task that missed wakeup preemption by a
+ * narrow margin doesn't have to wait for a full slice.
+ * This also mitigates buddy induced latencies under load.
+ */
+ if (delta_exec < sysctl_sched_min_granularity)
+ return;
+
+ /*
+ * When many tasks blow up the sched_period; it is possible that
+ * sched_slice() reports unusually large results (when many tasks are
+ * very light for example). Therefore impose a maximum.
+ */
+ ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
if (delta_exec > ideal_runtime) {
resched_curr(rq_of(cfs_rq));
/*
@@ -4597,19 +4613,24 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
}
/*
- * Ensure that a task that missed wakeup preemption by a
- * narrow margin doesn't have to wait for a full slice.
- * This also mitigates buddy induced latencies under load.
+ * Strictly speaking the above is sufficient; however due to
+ * imperfections it is possible to have a leftmost task left of
+ * min_vruntime.
+ *
+ * Also impose limits on the delta in vtime.
*/
- if (delta_exec < sysctl_sched_min_granularity)
- return;
se = __pick_first_entity(cfs_rq);
delta = curr->vruntime - se->vruntime;
-
if (delta < 0)
return;
+ /*
+ * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
+ * heavy tasks (for which vtime goes slower) get relatively more time
+ * before preemption, while light tasks (for which vtime goes faster)
+ * get relatively less time. IOW, heavy task get to run longer.
+ */
if (delta > ideal_runtime)
resched_curr(rq_of(cfs_rq));
}
Powered by blists - more mailing lists