linux-kernel - Re: 4.3 group scheduling regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20151012114723.GL3816@twins.programming.kicks-ass.net>
Date:	Mon, 12 Oct 2015 13:47:23 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Yuyang Du <yuyang.du@...el.com>
Cc:	Mike Galbraith <umgwanakikbuti@...il.com>,
	linux-kernel@...r.kernel.org
Subject: Re: 4.3 group scheduling regression

On Mon, Oct 12, 2015 at 10:12:31AM +0800, Yuyang Du wrote:
> On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote:

> > So in the old code we had 'magic' to deal with the case where a cgroup
> > was consuming less than 1 cpu's worth of runtime. For example, a single
> > task running in the group.
> > 
> > In that scenario it might be possible that the group entity weight:
> > 
> > 	se->weight = (tg->shares * cfs_rq->weight) / tg->weight;
> > 
> > Strongly deviates from the tg->shares; you want the single task reflect
> > the full group shares to the next level; due to the whole distributed
> > approximation stuff.
> 
> Yeah, I thought so.
>  
> > I see you've deleted all that code; see the former
> > __update_group_entity_contrib().
>  
> Probably not there, it actually was an icky way to adjust things.

Yeah, no argument there.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..b184da0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>  	 */
>  	tg_weight = atomic_long_read(&tg->load_avg);
>  	tg_weight -= cfs_rq->tg_load_avg_contrib;
> -	tg_weight += cfs_rq_load_avg(cfs_rq);
> +	tg_weight += cfs_rq->load.weight;
>  
>  	return tg_weight;
>  }
> @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
>  	long tg_weight, load, shares;
>  
>  	tg_weight = calc_tg_weight(tg, cfs_rq);
> -	load = cfs_rq_load_avg(cfs_rq);
> +	load = cfs_rq->load.weight;
>  
>  	shares = (tg->shares * load);
>  	if (tg_weight)

Aah, yes very much so. I completely overlooked that :-(

When calculating shares we very much want the current load, not the load
average.

Also, should we do the below? At this point se->on_rq is still 0 so
reweight_entity() will not update (dequeue/enqueue) the accounting, but
we'll have just accounted the 'old' load.weight.

Doing it this way around we'll first update the weight and then account
it, which seems more accurate.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700eb548315f..d2efef565aed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	enqueue_entity_load_avg(cfs_rq, se);
-	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
+	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/