linux-kernel - Re: [PATCH v3] sched/fair: limit sched slice duration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <5c65740d-6f10-f8c5-94a9-a1af631f0c07@arm.com>
Date:   Thu, 6 Oct 2022 13:34:57 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
        peterz@...radead.org, juri.lelli@...hat.com, rostedt@...dmis.org,
        bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com,
        vschneid@...hat.com, linux-kernel@...r.kernel.org
Cc:     zhangqiao22@...wei.com
Subject: Re: [PATCH v3] sched/fair: limit sched slice duration

On 03/10/2022 14:21, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
> 
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
> 
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
> 
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
> 
> Cap the ideal_runtime to sysctl_sched_latency to make sure that tasks will
> regularly make progress and will not be significantly impacted by
> idle/background tasks queued on the rq.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---
> 
> Change since v2:
> - Cap ideal_runtime from the beg as suggested by Peter
> 
>  kernel/sched/fair.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..c309d57efb2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4584,7 +4584,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	struct sched_entity *se;
>  	s64 delta;
>  
> -	ideal_runtime = sched_slice(cfs_rq, curr);
> +	/*
> +	 * When many tasks blow up the sched_period; it is possible that
> +	 * sched_slice() reports unusually large results (when many tasks are
> +	 * very light for example). Therefore impose a maximum.
> +	 */
> +	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> +
>  	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
>  	if (delta_exec > ideal_runtime) {
>  		resched_curr(rq_of(cfs_rq));

Tested on 6 CPU system (sysctl_sched_latency=18ms,
sysctl_sched_min_granularity=2.25ms)

I start to see `slice > period` when I run:

(a) > ~50 idle tasks in '/' for an arbitrary nice=0 task

(b) > ~50 nice=0 tasks in '/A' w/ cpu.shares = max for se of '/A'

Essentially in moments in which cfs_rq->nr_running > sched_nr_latency
and se_weight is relatively high compared to cfs_rq_weight.

Tested-By: Dietmar Eggemann <dietmar.eggemann@....com>