linux-kernel - Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171018124523.GA27508@e105550-lin.cambridge.arm.com>
Date:   Wed, 18 Oct 2017 13:45:25 +0100
From:   Morten Rasmussen <morten.rasmussen@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org, tj@...nel.org,
        josef@...icpanda.com, torvalds@...ux-foundation.org,
        vincent.guittot@...aro.org, efault@....de, pjt@...gle.com,
        clm@...com, dietmar.eggemann@....com, bsegall@...gle.com,
        yuyang.du@...el.com
Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

On Mon, Oct 09, 2017 at 11:45:17AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 09:08:57AM +0100, Morten Rasmussen wrote:
> > > --- a/kernel/sched/debug.c
> > > +++ b/kernel/sched/debug.c
> > > @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
> > >  			cfs_rq->removed.load_avg);
> > >  	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
> > >  			cfs_rq->removed.util_avg);
> > > +	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> > > +			cfs_rq->removed.runnable_sum);
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >  	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
> > >  			cfs_rq->tg_load_avg_contrib);
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
> > >  	se->avg.last_update_time = n_last_update_time;
> > >  }
> > >  
> > > -/* Take into account change of utilization of a child task group */
> > > +
> > > +/*
> > > + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> > > + * propagate its contribution. The key to this propagation is the invariant
> > > + * that for each group:
> > > + *
> > > + *   ge->avg == grq->avg						(1)
> > > + *
> > > + * _IFF_ we look at the pure running and runnable sums. Because they
> > > + * represent the very same entity, just at different points in the hierarchy.
> > > + *
> > > + *
> > > + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> > > + * simply copies the running sum over.
> > > + *
> > > + * However, update_tg_cfs_runnable() is more complex. So we have:
> > > + *
> > > + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
> > > + *
> > > + * And since, like util, the runnable part should be directly transferable,
> > > + * the following would _appear_ to be the straight forward approach:
> > > + *
> > > + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
> > 
> > Should it be grq->avg.runnable_avg instead of running_avg?
> 
> Yes very much so. Typing hard. Otherwise (3) would not follow from (2)
> either.
> 
> > cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to
> > be:
> > 
> > 	load_avg = \Sum se->avg.load_avg
> > 		 = \Sum se->load.weight * se->avg.runnable_avg
> > 
> > That sum will increase when ge is runnable regardless of whether it is
> > running or not. So, I think it has to be runnable_avg to make sense?
> 
> Ack.
> 
> > > + *
> > > + * And per (1) we have:
> > > + *
> > > + *   ge->avg.running_avg == grq->avg.running_avg
> > 
> > You just said further up that (1) only applies to running and runnable
> > sums? These are averages, so I think this is invalid use of (1). But
> > maybe that is part of your point about (4) being wrong?
> > 
> > I'm still trying to get my head around the remaining bits, but it sort
> > of depends if I understood the above bits correctly :)
> 
> So while true, the thing we're looking for is indeed runnable_avg.
> 
> > > + *
> > > + * Which gives:
> > > + *
> > > + *                      ge->load.weight * grq->avg.load_avg
> > > + *   ge->avg.load_avg = -----------------------------------		(4)
> > > + *                               grq->load.weight
> > > + *
> > > + * Except that is wrong!
> > > + *
> > > + * Because while for entities historical weight is not important and we
> > > + * really only care about our future and therefore can consider a pure
> > > + * runnable sum, runqueues can NOT do this.
> > > + *
> > > + * We specifically want runqueues to have a load_avg that includes
> > > + * historical weights. Those represent the blocked load, the load we expect
> > > + * to (shortly) return to us. This only works by keeping the weights as
> > > + * integral part of the sum. We therefore cannot decompose as per (3).
> > > + *
> > > + * OK, so what then?
> 
> And as the text above suggests, we cannot decompose because it contains
> the blocked weight, which is not included in grq->load.weight and thus
> things come apart.
> 
> > > + * Another way to look at things is:
> > > + *
> > > + *   grq->avg.load_avg = \Sum se->avg.load_avg
> > > + *
> > > + * Therefore, per (2):
> > > + *
> > > + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> > > + *
> > > + * And the very thing we're propagating is a change in that sum (someone
> > > + * joined/left). So we can easily know the runnable change, which would be, per
> > > + * (2) the already tracked se->load_avg divided by the corresponding
> > > + * se->weight.
> > > + *
> > > + * Basically (4) but in differential form:
> > > + *
> > > + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> > > + *								   (5)
> > > + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> 
> And this all has runnable again, and so should make sense.

I'm afraid I don't quite get why (5) is correct. It might be related to
the issues Vincent already pointed out.

d(runnable_avg) is the runnable_avg series for the joining/leaving se
which is contributing to grq->avg.load_avg, but I don't see how you can
use that to compute the impact on ge->avg.load_avg.

	ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)

In (5) you have just substituted ge->avg.runnable_avg with
d(runnable_avg) in (2). However, the relationship between
ge->avg.runnable_avg and se->avg.runnable_avg is complicated. ge is
runnable whenever se is, but the reverse isn't necessarily true. Let's
say you have two always-runnable tasks on your grq and one of the leaves
(migrates away). In that case, ge->avg.runnable_avg is equal to
se->avg.runnable_avg (both always-runnable) which is d(runnable_avg), so
in (5) we end up with:

	ge->avg.load_avg =	ge->load.weight * ge->avg.runnable_avg
			      - ge->load.weight * se->avg.runnable_avg
			 = 0

But you still have one always-running task on the grq so clearly it
shouldn't be zero.

IOW, AFAICT, it is not possible to decompose ge->avg.runnable_avg into
contributions from each individual se on the grq. At least not without
some additional assumptions.

What am I missing?

Morten