linux-kernel - Re: [PATCHSET v6 0/4] Split iowait into two states

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dd66e38-2b00-4b89-8b8c-cc25ad39dcb8@arm.com>
Date: Wed, 21 Aug 2024 16:57:09 +0100
From: Christian Loehle <christian.loehle@....com>
To: Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, tglx@...utronix.de,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCHSET v6 0/4] Split iowait into two states

On 8/21/24 16:04, Jens Axboe wrote:
> On 8/21/24 8:54 AM, Christian Loehle wrote:
>> On 8/19/24 16:39, Jens Axboe wrote:
>>> Hi,
>>>
>>> This is v6 of the patchset where the current in_iowait state is split
>>> into two parts:
>>>
>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>    in terms of sleep and wakeup latencies.
>>> 2) The above, and also accounted as such in the iowait stats.
>>>
>>> The current ->in_iowait covers both, this series splits it into two types
>>> of state so that each can be controlled seperately.
>>
>> Hi Jens,
>> I wanted to give a brief update on where I think we're at in terms
>> of iowait behavior regarding cpuidle and cpufreq.
>> I'm still working on getting both removed, given the discussions had
>> on the list [0] and at OSPM [1] this seems realistic and the best way
>> forward IMO.
>> That would then naturally make this series and the iowait workaround in
>> io_uring/io_uring.c unnecessary.
>>
>> 1. For cpuidle:
>> Main issue with relying on nr_iowaiters is that there is no guarantee
>> whatsoever that these tasks will wakeup where they went to sleep so if
>> we can achieve the same throughput without nr_iowaiters it shouldn't
>> be relevant.
>> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
>> from menu seems hard, essentially nobody has worked on menu seriously for
>> a while now. Thus the plan here is to replace menu by teo eventually.
>> For your io_uring workloads I see throughput on par for teo (doesn't rely
>> on iowait) and menu.
>>
>> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>> Engine=preadv2
>> IOPS=22500, BW=87MiB/s, IOS/call=0/0
>> IOPS=21916, BW=85MiB/s, IOS/call=1/0
>> IOPS=21774, BW=85MiB/s, IOS/call=1/0
>> IOPS=22467, BW=87MiB/s, IOS/call=1/0
>> Exiting on timeout
>> Maximum IOPS=22500
>> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
>> [  178.754571] cpuidle: using governor menu
>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>> Engine=preadv2
>> IOPS=21452, BW=83MiB/s, IOS/call=0/0
>> IOPS=21778, BW=85MiB/s, IOS/call=1/0
>> IOPS=21120, BW=82MiB/s, IOS/call=1/0
>> IOPS=20903, BW=81MiB/s, IOS/call=1/0
>> Exiting on timeout
>> Maximum IOPS=21778
>>
>> Please do give it a try for yourself as well!
>>
>> 2. For cpufreq:
>> Main issue for IO-bound workloads with iowait boosting is we're punishing
>> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
>> part, which is already bad because of the scheduling overhead induced) by
>> making them energy-inefficient to make synthetic benchmarks happy.
>> A study of more realistic workloads show that they don't suffer from a problem
>> of building up utilization, not util_est anyway, so they don't actually benefit
>> from a cpufreq boost.
>> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
>> altogether if we accept some degradation of benchmarks like
>> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>> or
>> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
>> (non-io_uring) for that matter.
> 
> The original iowait addition came because a big regression was seen
> compared to not setting iowait, it was around 20% iirc. That's big, and
> not in the realm of "some degradation" that will be acceptable. And that
> will largely depend on the system being used. On some systems, it'll be
> less, and on some it'll be more.

We are also talking about power regressions of 1000% easily FWIW for e.g.
fio --name=fio --rw=randread --bs=4k --runtime=10 --time_based --filename=/dev/nvme0n1 --iodepth=32 --numjobs=nr_cpus --ioengine=io_uring
(without any throughput gain).

> 
>> For io_uring where the expected case is probably not single-threaded
>> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
>> use-cases by pushing it to less efficient frequencies that might not
>> be needed.
> 
> People do all sorts of things, and sync (or low queue depth) IO is
> certainly one of the use cases. In fact that's where the above report
> came from, on the postgres aio side.

I have looked at that and (on the platforms I've tested) that was indeed
from cpuidle FWIW. Moving away from menu did remedy this with the
mainlined teo fixes.

>> I know you want your problem (io_uring showing up as 100% busy even
>> though it's just sleeping) to be solved like yesterday and my opinion
>> on a future timeline might not be enough to convince you of much. I
>> wanted to share it anyway. I don't see an issue with the actual code
>> you're proposing, but it does feel like a step in the wrong direction
>> to me.
> 
> As mentioned in my original reply, I view this as entirely orthogonal,
> and while I appreciate your efforts in this area, I'm a little tired of
> this being brought up as a gatekeeping metric when it's not there.

I can understand you being tired of me bringing this up, but I'm not
gatekeeping this series, not intentionally anyway.
Just trying to give some perspective on the entire iowait behavior
future.

> 
> If we can eliminate iowait for boosting down the line, then I'm all for
> it. But this has now been pending for > 6 months and I don't think it's
> far to keep stringing this along on a future promise. This isn't a lot
> of code and it solves the issue for now, if the code will get removed
> down the line as not needed, then that's certainly fine. For now, we
> need it.

I'm fine with carrying a revert of the series along my patchset.

Regards,
Christian