linux-kernel - Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <80da988f-899e-4b93-a648-ffd0680d4000@arm.com>
Date: Tue, 7 May 2024 16:19:20 +0100
From: Christian Loehle <christian.loehle@....com>
To: Qais Yousef <qyousef@...alina.io>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, linux-kernel@...r.kernel.org,
 peterz@...radead.org, juri.lelli@...hat.com, mingo@...hat.com,
 dietmar.eggemann@....com, vschneid@...hat.com, vincent.guittot@...aro.org,
 Johannes.Thumshirn@....com, adrian.hunter@...el.com, ulf.hansson@...aro.org,
 andres@...razel.de, asml.silence@...il.com, linux-pm@...r.kernel.org,
 linux-block@...r.kernel.org, io-uring@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost

On 29/04/2024 12:18, Qais Yousef wrote:
> On 04/19/24 14:42, Christian Loehle wrote:
> 
>>> I think the major thing we need to be careful about is the behavior when the
>>> task is sleeping. I think the boosting will be removed when the task is
>>> dequeued and I can bet there will be systems out there where the BLOCK softirq
>>> being boosted when the task is sleeping will matter.
>>
>> Currently I see this mainly protected by the sugov rate_limit_us.
>> With the enqueue's being the dominating cpufreq updates it's not really an
>> issue, the boost is expected to survive the sleep duration, during which it
>> wouldn't be active.
>> I did experiment with some sort of 'stickiness' of the boost to the rq, but
>> it is somewhat of a pain to deal with if we want to remove it once enqueued
>> on a different rq. A sugov 1ms timer is much simpler of course.
>> Currently it's not necessary IMO, but for the sake of being future-proof in
>> terms of more frequent freq updates I might include it in v2.
> 
> Making sure things work with purpose would be really great. This implicit
> dependency is not great IMHO and make both testing and reasoning about why
> things are good or bad harder when analysing real workloads. Especially by non
> kernel developers.

Agreed.
Even without your proposed changes [1] relying on sugov rate_limit_us is
unfortunate.
There is a problem with an arbitrarily low rate_limit_us more generally, not
just because we kind of rely on the CPU being boosted right before the task is
actually enqueued (for the interrupt/softirq part of it), but also because of
the latency from requested frequency improvement to actually running on that
frequency. If the task is 90% done by the time it sees the improvement and
the frequency will be updated (back to a lower one) before the next enqueue,
then that's hardly worth the effort.
Currently this is covered by rate_limit_us probabillistically and that seems
to be good enough in practice, but it's not very pleasing (and also EAS can't
take it into consideration).
That's not just exclusive for iowait wakeup tasks of course, but in theory any
that is off the rq frequently (and still requests a higher frequency than it can
realistically build up through util_avg like through uclamp_min).

>>>
>>> FWIW I do have an implementation for per-task iowait boost where I went a step
>>> further and converted intel_pstate too and like Christian didn't notice
>>> a regression. But I am not sure (rather don't think) I triggered this use case.
>>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
>>> so and they are actually shared but not visible at linux level.
>>
>> Please do share your intel_pstate proposal!
> 
> This is what I had. I haven't been working on this for the past few months, but
> I remember tried several tests on different machines then without a problem.
> I tried to re-order patches at some point though and I hope I didn't break
> something accidentally and forgot the state.
> 
> https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
> 

Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
loud:
- iowait boost is taken into consideration for task placement, but with just the
4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
wakeups to land on the big instead of running at max OPP of a LITTLE).
- If the current iowait boost decay is sensible is questionable, but there should
probably be some decay. Taken to the extreme this would mean something
like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
Repeating myself here, but iowait wakeups itself is tricky to work with (and I
try to work around that).
- The intel_pstate solution will increase boost even if
previous_wakeup->iowait_boost > current->iowait_boost
right? But using current->iowait_boost is a clever idea.

[1]
https://lore.kernel.org/lkml/ZgKFT5b423hfQdl9@gmail.com/T/

Kind Regards,
Christian