linux-kernel - Re: [PATCH v5 00/10] track CPU utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 4 Jun 2018 18:13:40 +0100
From:   Quentin Perret <quentin.perret@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>, mingo@...nel.org,
        linux-kernel@...r.kernel.org, rjw@...ysocki.net,
        juri.lelli@...hat.com, dietmar.eggemann@....com,
        Morten.Rasmussen@....com, viresh.kumar@...aro.org,
        valentin.schneider@....com
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Monday 04 Jun 2018 at 18:50:47 (+0200), Peter Zijlstra wrote:
> On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:
> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> > reflect anymore the utilization of cfs tasks but only the remaining part that
> > is not used by rt tasks. We should monitor the stolen utilization and take
> > it into account when selecting OPP. This patchset doesn't change the OPP
> > selection policy for RT tasks but only for CFS tasks
> 
> So the problem is that when RT/DL/stop/IRQ happens and preempts CFS
> tasks, time continues and the CFS load tracking will see !running and
> decay things.
> 
> Then, when we get back to CFS, we'll have lower load/util than we
> expected.
> 
> In particular, your focus is on OPP selection, and where we would have
> say: u=1 (always running task), after being preempted by our RT task for
> a while, it will now have u=.5. With the effect that when the RT task
> goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?
> 
> Your solution is to track RT/DL/stop/IRQ with the identical PELT average
> as we track cfs util. Such that we can then add the various averages to
> reconstruct the actual utilisation signal.
> 
> This should work for the case of the utilization signal on UP. When we
> consider that PELT migrates the signal around on SMP, but we don't do
> that to the per-rq signals we have for RT/DL/stop/IRQ.
> 
> There is also the 'complaint' that this ends up with 2 util signals for
> DL, complicating things.
> 
> 
> So this patch-set tracks the !cfs occupation using the same function,
> which is all good. But what, if instead of using that to compensate the
> OPP selection, we employ that to renormalize the util signal?
> 
> If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> then I think your initial problem goes away. Because while the RT task
> will push the util to .5, it will at the same time push the CPU capacity
> to .5, and renormalized that gives 1.
> 
>   NOTE: the renorm would then become something like:
>         scale_cpu = arch_scale_cpu_capacity() / rt_frac();

Isn't it equivalent ? I mean, you can remove RT/DL/stop/IRQ from the CPU
capacity and compare the CFS util_avg against that, or you can add
RT/DL/stop/IRQ to the CFS util_avg and compare it to arch_scale_cpu_capacity().
Both should be interchangeable no ? By adding RT/DL/IRQ PELT signals
to the CFS util_avg, Vincent is proposing to go with the latter I think.

But aren't the signals we currently use to account for RT/DL/stop/IRQ in
cpu_capacity good enough for that ? Can't we just add the diff between
capacity_orig_of and capacity_of to the CFS util and do OPP selection with
that (for !nr_rt_running) ? Maybe add a min with dl running_bw to be on
the safe side ... ?

> 
> 
> On IRC I mentioned stopping the CFS clock when preempted, and while that
> would result in fixed numbers, Vincent was right in pointing out the
> numbers will be difficult to interpret, since the meaning will be purely
> CPU local and I'm not sure you can actually fix it again with
> normalization.
> 
> Imagine, running a .3 RT task, that would push the (always running) CFS
> down to .7, but because we discard all !cfs time, it actually has 1. If
> we try and normalize that we'll end up with ~1.43, which is of course
> completely broken.
> 
> 
> _However_, all that happens for util, also happens for load. So the above
> scenario will also make the CPU appear less loaded than it actually is.
> 
> Now, we actually try and compensate for that by decreasing the capacity
> of the CPU. But because the existing rt_avg and PELT signals are so
> out-of-tune, this is likely to be less than ideal. With that fixed
> however, the best this appears to do is, as per the above, preserve the
> actual load. But what we really wanted is to actually inflate the load,
> such that someone will take load from us -- we're doing less actual work
> after all.
> 
> Possibly, we can do something like:
> 
> 	scale_cpu_capacity / (rt_frac^2)
> 
> for load, then we inflate the load and could maybe get rid of all this
> capacity_of() sprinkling, but that needs more thinking.
> 
> 
> But I really feel we need to consider both util and load, as this issue
> affects both.