linux-kernel - Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71e7e154-584a-4856-b906-ba92c636b17f@arm.com>
Date: Thu, 3 Oct 2024 09:15:31 +0100
From: Lukasz Luba <lukasz.luba@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org, qyousef@...alina.io, hongyan.xia2@....com,
 Quentin Perret <qperret@...gle.com>
Subject: Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized

Hi Vincent,

On 10/3/24 07:27, Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@...gle.com> wrote:
>>
>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>> utilization"), the util_est remains set the value before having to
>>> share the cpu with other tasks which means that the util_est remains
>>> correct even if its util_avg decrease because of sharing the cpu with
>>> other task. This has been done to cover the cases that you mention
>>> above whereboth util_avg and util_est where decreasing when tasks
>>> starts to  share  the CPU bandwidth with others
>>
>> I don't think I agree about the correctness of that util_est value at
>> all. The above patch only makes it arbitrarily out of date in the truly
>> overcommitted case. All the util-based heuristic we have in the
>> scheduler are based around the assumption that the close future will
>> look like the recent past, so using an arbitrarily old util-est is still
>> incorrect. I can understand how this may work OK in RT-app or other
> 
> This fixes a real use case on android device
> 
>> use-cases with perfectly periodic tasks for their entire lifetime and
>> such, but this doesn't work at all in the general case.
>>
>>> And feec() will return -1 for that case because util_est remains high
>>
>> And again, checking that a task fits is broken to start with if we don't
>> know how big the task is. When we have reasons to believe that the util
>> values are no longer correct (and the absence of idle time is a very
>> good reason for that) we just need to give up on them. The fact that we
>> have to resort to using out-of-date data to sort of make that work is
>> just another proof that this is not a good idea in the general case.
> 
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
> 
>>
>>> the commit that I mentioned above covers those cases and the task will
>>> not incorrectly fit to another smaller CPU because its util_est is
>>> preserved during the overutilized phase
>>
>> There are other reasons why a task may look like it fits, e.g. two tasks
>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
> 
> 50% of what ? not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here. I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case
> 
>> CPU looks half empty. Is it half empty? We've got no way to tell until
> 
> The same here, it's not thanks to util_est
> 
>> we see idle time. The current util_avg and old util_est value are just
>> not helpful, they're both bad signals and we should just discard them.
>>
>> So again I do feel like the best way forward would be to change the
>> nature of the OU threshold to actually ask cpuidle 'when was the last
>> time there was idle time?' (or possibly cache that in the idle task
>> directly). And then based on that we can decide whether we want to enter
>> feec() and do util-based decision, or to kick the push-pull mechanism in
>> your other patches, things like that. That would solve/avoid the problem
>> I mentioned in the previous paragraph and make the OU detection more
>> robust. We could also consider using different thresholds in different
>> places to re-enable load-balancing earlier, and give up on feec() a bit
>> later to avoid messing the entire task placement when we're only
>> transiently OU because of misfit. But eventually, we really need to just
>> give up on util values altogether when we're really overcommitted, it's
>> really an invariant we need to keep.
> 
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about. The redesign of OU will come in a different series
> as this implies more rework. IIUC your point, we are more interested
> by the prev cpu than the current one
> 

What do you mean by that?
Is it due to OU in e.g. Little cluster?
Is it amplified by the uclamp_max usage?

You're re-writing heavily the EAS+EM and I would like to understand
your motivation.

BTW, do you know that if you or anyone wants to improve the EAS/EM
should be able to provide the power numbers?

W/o the power numbers the discussion is moot. Many times SW engineers
have wrong assumptions about HW, therefore we have to test and
measure. There are confidential power saving techniques in HW
that can be missed and some ugly workaround created in SW for issues
which don't exist.

That's why we have to discuss the power numbers.

This _subject_ is not different. If EAS is going to help
even in OU state - we need the numbers.

Regards,
Lukasz