linux-kernel - Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs reweight

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170929090434.GB962@e105550-lin.cambridge.arm.com>
Date:   Fri, 29 Sep 2017 10:04:34 +0100
From:   Morten Rasmussen <morten.rasmussen@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org, tj@...nel.org,
        josef@...icpanda.com, torvalds@...ux-foundation.org,
        vincent.guittot@...aro.org, efault@....de, pjt@...gle.com,
        clm@...com, dietmar.eggemann@....com, bsegall@...gle.com,
        yuyang.du@...el.com
Subject: Re: [PATCH -v2 03/18] sched/fair: Cure calc_cfs_shares() vs
 reweight_entity()

On Fri, Sep 01, 2017 at 03:21:02PM +0200, Peter Zijlstra wrote:
> Vincent reported that when running in a cgroup, his root
> cfs_rq->avg.load_avg dropped to 0 on task idle.
> 
> This is because reweight_entity() will now immediately propagate the
> weight change of the group entity to its cfs_rq, and as it happens,
> our approxmation (5) for calc_cfs_shares() results in 0 when the group
> is idle.
> 
> Avoid this by using the correct (3) as a lower bound on (5). This way
> the empty cgroup will slowly decay instead of instantly drop to 0.
> 
> Reported-by: Vincent Guittot <vincent.guittot@...aro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
>  kernel/sched/fair.c |    7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2703,11 +2703,10 @@ static long calc_cfs_shares(struct cfs_r
>  	tg_shares = READ_ONCE(tg->shares);
>  
>  	/*
> -	 * This really should be: cfs_rq->avg.load_avg, but instead we use
> -	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
> -	 * the shares for small weight interactive tasks.
> +	 * Because (5) drops to 0 when the cfs_rq is idle, we need to use (3)
> +	 * as a lower bound.
>  	 */
> -	load = scale_load_down(cfs_rq->load.weight);
> +	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);

We use cfs_rq->tg_load_avg_contrib (the filtered version of
cfs_rq->avg.load_avg) instead of cfs_rq->avg.load_avg further down, so I
think we should here too for consistency.

+	load = max(scale_load_down(cfs_rq->load.weight),
+		   cfs_rq->tg_load_avg_contrib);

With this change (5) almost becomes (3):

   ge->load.weight =

                 tg->weight * max(grq->load.weight, grq->avg.load_avg)
     ---------------------------------------------------------------------------
     tg->load_avg - grq->avg.load_avg + max(grq->load.weight, grq->avg.load_avg)

The difference is that we boost ge->load.weight for if the grq has
runnable tasks with se->avg.load_avg < se->load.weight, i.e. tasks that
occasionally block. This means that the underestimate scenario I have in
my reply for patch #2 is no longer possible. AFAICT, we are now
guaranteed to over-estimate ge->load.weight. It is still quite sensitive
to periodic high priority tasks though.

tg->weight              = 1024
tg->load_avg            = 2560
\Sum grq->load.weight   = 2048

cpu                     0       1       \Sum
grq->avg.load_avg       1536    1024
grq->load.weight        1024    1024
load (max)		1536	1024
ge->load_weight (1)     512     512     1024 >= tg->weight
ge->load_weight (3)     614     410     1024 >= tg->weight
ge->load_weight (5)     512     410     922 < tg->weight
ge->load_weight (5*)    614     410     1024 >= tg->weight