linux-kernel - Re: [PATCH 4/4] sched/core: split iowait state into two states

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <530475c7-5bba-444c-8045-0e4f8679306e@arm.com>
Date: Thu, 25 Apr 2024 11:39:40 +0100
From: Christian Loehle <christian.loehle@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
 tglx@...utronix.de, "Rafael J. Wysocki" <rjw@...ysocki.net>,
 linux-pm@...r.kernel.org, daniel.lezcano@...aro.org
Subject: Re: [PATCH 4/4] sched/core: split iowait state into two states

On 25/04/2024 11:16, Peter Zijlstra wrote:
> On Wed, Apr 24, 2024 at 11:08:42AM +0100, Christian Loehle wrote:
>> On 24/04/2024 11:01, Peter Zijlstra wrote:
>>> On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
>>>> iowait is a bogus metric, but it's helpful in the sense that it allows
>>>> short waits to not enter sleep states that have a higher exit latency
>>>> than would've otherwise have been picked for iowait'ing tasks. However,
>>>> it's harmless in that lots of applications and monitoring assumes that
>>>> iowait is busy time, or otherwise use it as a health metric.
>>>> Particularly for async IO it's entirely nonsensical.
>>>
>>> Let me get this straight, all of this is about working around
>>> cpuidle menu governor insaity?
>>>
>>> Rafael, how far along are we with fully deprecating that thing? Yes it
>>> still exists, but should people really be using it still?
>>>
>>
>> Well there is also the iowait boost handling in schedutil and intel_pstate
>> which, at least in synthetic benchmarks, does have an effect [1].
> 
> Those are cpufreq not cpuidle and at least they don't use nr_iowait. The
> original Changelog mentioned idle states, and I hate on menu for using
> nr_iowait.

I'd say they care about any regression, but I'll let Jens answer that.
The original change also mentions cpufreq and Jens did mention in an
earlier version that he doesn't care, for them it's all just increased
latency ;) 
https://lore.kernel.org/lkml/00d36e83-c9a5-412d-bf49-2e109308d6cd@arm.com/T/#m216536520bc31846aff5875993d22f446a37b297

> 
>> io_uring (the only user of iowait but not iowait_acct) works around both.
>>
>> See commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
>>
>> [1]
>> https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/#t
> 
> So while I agree with most of the short-commings listed in that set,
> however that patch is quite terrifying.

Not disagreeing with you on that.

> 
> I would prefer to start with something a *lot* simpler. How about a tick
> driven decay of iops count per task. And that whole step array
> *shudder*.

It's an attempt of solving unnecessary boosting based upon what is there for
us to work with now: iowait wakeups.
There are many workloads with e.g. > 5000 iowait wakeups per second that don't
benefit from boosting at all (and therefore it's a complete energy waste).
I don't see anything obvious how we would attempt to detect non-boost-worthy
scenarios with a tick driven decay count, but please do elaborate.

(If you *really* care about IO throughput, the task wakeup path is hopefully
not critical anyway (i.e. you do everything in your power to have IO pending
during that time) and then we don't need boosting, but just looking
at a tick-length period doesn't let us distinguish those scenarios AFAICS.)

Regards,
Christian