lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zv5UDufqpoXvSocy@google.com>
Date: Thu, 3 Oct 2024 08:21:34 +0000
From: Quentin Perret <qperret@...gle.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
	rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org,
	qyousef@...alina.io, hongyan.xia2@....com
Subject: Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized

On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@...gle.com> wrote:
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu

This value is arbitrarily old, so of course it is out of date. This only
sort of works for tasks that don't change their behaviour. That's true
for some use-cases, yes, but absolutely not in the general case. How
can you know that the last correct value before sharing the CPU is still
valid minutes later? The fact that the system started to be
overcommitted is a good indication that something has changed, so we
really can't tell. Also, how is any of this going to work for newly
created tasks while we're overcommitted for example?

> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> >
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ?

50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
for simplicity I assumed no 'pressure' of any kind).

> not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here

That could very well be the case, which piece are you thinking of?

> I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case

But why are we even entering feec() then? Isn't this just looking for
trouble really? As per the example above, task migrations can cause util
'gaps' on the source CPU which may make it appear like a good candidate
from an energy standpoint, but it's all bogus really. And let's not even
talk about how wrong the EM is going be when simulating a potential task
migration in the overcommitted case.

> > CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est

And again, an out-of-date util est value is not helpful in the general
case. It helps certain use-cases, sure, but please let's not promote it
to a load-bearing construct on top of which we build our entire
scheduling strategy :-)

> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> >
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about.

Cool, and FWIW I am supportive of making this whole part of the code
better -- a transient OU state due to misfit does make a mess of things
and we should indeed be able to do better.

> The redesign of OU will come in a different series
> as this implies more rework.

Ack, this can be made orthogonal to this work I think.

> IIUC your point, we are more interested
> by the prev cpu than the current one

Hmm, not sure to understand that part. What do you mean?

Thanks,
Quentin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ