linux-kernel - Re: [PATCH v2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 11 Apr 2017 09:52:21 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org,
        dietmar.eggemann@....com, Morten.Rasmussen@....com,
        yuyang.du@...el.com, pjt@...gle.com, bsegall@...gle.com
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT

Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :
> 
> Thanks for the rebase.
> 
> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> 
> Ok, so let me try and paraphrase what this patch does.
> 
> So consider a task that runs 16 out of our 32ms window:
> 
>    running   idle
>   |---------|---------|
> 
> 
> You're saying that when we scale running with the frequency, suppose we
> were at 50% freq, we'll end up with:
> 
>    run  idle
>   |----|---------|
> 
> 
> Which is obviously a shorter total then before; so what you do is add
> back the lost idle time like:
> 
>    run  lost idle
>   |----|----|---------|
> 
> 
> to arrive at the same total time. Which seems to make sense.

Yes

> 
> Now I have vague memories of Morten having issues with your previous
> patches, so I'll wait for him to chime in as well.

IIRC, Morten's concerns were about the lost idle time which was not taken into account in previous version.

> 
> 
> On to the implementation:
> 
> >  /*
> > + * Scale the time to reflect the effective amount of computation done during
> > + * this delta time.
> > + */
> > +static __always_inline u64
> > +scale_time(u64 delta, int cpu, struct sched_avg *sa,
> > +		unsigned long weight, int running)
> > +{
> > +	if (running) {
> > +		sa->stolen_idle_time += delta;
> > +		/*
> > +		 * scale the elapsed time to reflect the real amount of
> > +		 * computation
> > +		 */
> > +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> > +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> > +
> > +		/*
> > +		 * Track the amount of stolen idle time due to running at
> > +		 * lower capacity
> > +		 */
> > +		sa->stolen_idle_time -= delta;
> 
> OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from
> above.
> 
> > +	} else if (!weight) {
> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> 
> But here I'm completely lost. WTF just happened ;-)
> 
> Firstly, I think we want a comment on why we care about the !weight
> case. Why isn't !running sufficient?

We track the time when the task is "really" idle but not the time that the task spent
to wait for running on the CPU. Running is used to detect when the task is really
running and how much idle time has been lost while weight is used to detect when the
task is back to sleep state and when we have account the lost idle time.

>
> 
> Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

The lost idle time makes sense only if the task can also be "idle" when running at max
capacity. When util_sum reaches the LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks
are considered to be the same as we can't make any difference between a task running
400ms or a task running 400sec. It means that these tasks are "always running" tasks
even at max capacity. In this case, there is no lost idle time as they always run and
tracking and adding back the lost idle time because we run at lower capacity doesn't
make sense anymore so we discard it. Then an always running task can have a util_sum
that is less than the max value because of the rounding (util_avg varies between
[1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024

> 
> Is that to deal with cpu_capacity?
> 
> 
> > +			/*
> > +			 * Add the idle time stolen by running at lower compute
> > +			 * capacity
> > +			 */
> > +			delta += sa->stolen_idle_time;
> > +		}
> > +		sa->stolen_idle_time = 0;
> > +	}
> > +
> > +	return delta;
> > +}
> 
> 
> Thirdly, I'm thinking this isn't quite right. Imagine a task that's
> running across a decay window, then we'll only add back the stolen_idle
> time in the next window, even though it should've been in this one,
> right?

I don't think so because the PELT should not see more decay window at half capacity
than at max capacity.
In the example below we can see that we cross the absolute time decay window when
running at half capacity but once we scale the running delta time we don't cross
it anymore and the update of PELT is done in the same manner in both case

decay window |-------|-------|-------|--
max capacity ---xxxx------------xxxx----
update         *   *           *   *

half capacity---xxxxxxxx--------xxxxxxxx
accounted    ---xxxx____--------xxxx____
update         *   *           *   *

x running
- sleep
_ lost idle time