linux-kernel - Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170929113500.kryvylelokyohssz@hirez.programming.kicks-ass.net>
Date:   Fri, 29 Sep 2017 13:35:00 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Morten Rasmussen <morten.rasmussen@....com>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org, tj@...nel.org,
        josef@...icpanda.com, torvalds@...ux-foundation.org,
        vincent.guittot@...aro.org, efault@....de, pjt@...gle.com,
        clm@...com, dietmar.eggemann@....com, bsegall@...gle.com,
        yuyang.du@...el.com
Subject: Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()

On Thu, Sep 28, 2017 at 11:03:03AM +0100, Morten Rasmussen wrote:

> > +/*
> > + * All this does is approximate the hierarchical proportion which includes that
> > + * global sum we all love to hate.
> > + *
> > + * That is, the weight of a group entity, is the proportional share of the
> > + * group weight based on the group runqueue weights. That is:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = -----------------------------               (1)
> > + *			  \Sum grq->load.weight
> > + *
> > + * Now, because computing that sum is prohibitively expensive to compute (been
> > + * there, done that) we approximate it with this average stuff. The average
> > + * moves slower and therefore the approximation is cheaper and more stable.
> > + *
> > + * So instead of the above, we substitute:
> > + *
> > + *   grq->load.weight -> grq->avg.load_avg                         (2)
> > + *
> > + * which yields the following:
> > + *
> > + *                     tg->weight * grq->avg.load_avg
> > + *   ge->load.weight = ------------------------------              (3)
> > + *				tg->load_avg
> > + *
> > + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> > + *
> > + * That is shares_avg, and it is right (given the approximation (2)).
> > + *
> > + * The problem with it is that because the average is slow -- it was designed
> > + * to be exactly that of course -- this leads to transients in boundary
> > + * conditions. In specific, the case where the group was idle and we start the
> > + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> > + * yielding bad latency etc..
> > + *
> > + * Now, in that special case (1) reduces to:
> > + *
> > + *                     tg->weight * grq->load.weight
> > + *   ge->load.weight = ----------------------------- = tg>weight   (4)
> > + *			    grp->load.weight
> 
> Should it be "grq->load.weight" in the denominator of (4)?
> And "tg->weight" at the end?

Yes, otherwise its all doesn't really make sense :-) Typing is hard it
seems.

> > + *
> > + * That is, the sum collapses because all other CPUs are idle; the UP scenario.
> 
> Shouldn't (3) collapse in the same way too in this special case?

That's more difficult to see and (1) is the canonical form.

> In theory it should reduce to:
> 
>                      tg->weight * grq->avg.load_avg
>    ge->load.weight = ------------------------------
> 			grq->avg.load_avg

Yes, agreed.

> But I can see many reasons why it won't happen in practice if things
> aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
> (3) aren't in sync, or there are stale contributions to tg->load_avg
> from other cpus then (3) can return anything between 0 and tg->weight.

Just so.

> > + *
> > + * So what we do is modify our approximation (3) to approach (4) in the (near)
> > + * UP case, like:
> > + *
> > + *   ge->load.weight =
> > + *
> > + *              tg->weight * grq->load.weight
> > + *     ---------------------------------------------------         (5)
> > + *     tg->load_avg - grq->avg.load_avg + grq->load.weight
> > + *
> > + *
> > + * And that is shares_weight and is icky. In the (near) UP case it approaches
> > + * (4) while in the normal case it approaches (3). It consistently
> > + * overestimates the ge->load.weight and therefore:
> > + *
> > + *   \Sum ge->load.weight >= tg->weight
> > + *
> > + * hence icky!
> 
> IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
> tasks, you can end up with underestimating the ge->load.weight for some
> of the group entities lead to \Sum ge->load.weight < tg->weight.

Ah yes, you're right. However, if you look at the end of the series we
actually end up with using:

	max(grq->load.weight, grq->avg.load_avg)

Which I suppose makes it true again.