linux-kernel - Re: [PATCHSET v6 0/4] Split iowait into two states

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c8cd6339-c168-4409-8cc4-e85e7ad92914@arm.com>
Date: Wed, 21 Aug 2024 15:54:31 +0100
From: Christian Loehle <christian.loehle@....com>
To: Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, tglx@...utronix.de,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCHSET v6 0/4] Split iowait into two states

On 8/19/24 16:39, Jens Axboe wrote:
> Hi,
> 
> This is v6 of the patchset where the current in_iowait state is split
> into two parts:
> 
> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>    in terms of sleep and wakeup latencies.
> 2) The above, and also accounted as such in the iowait stats.
> 
> The current ->in_iowait covers both, this series splits it into two types
> of state so that each can be controlled seperately.

Hi Jens,
I wanted to give a brief update on where I think we're at in terms
of iowait behavior regarding cpuidle and cpufreq.
I'm still working on getting both removed, given the discussions had
on the list [0] and at OSPM [1] this seems realistic and the best way
forward IMO.
That would then naturally make this series and the iowait workaround in
io_uring/io_uring.c unnecessary.

1. For cpuidle:
Main issue with relying on nr_iowaiters is that there is no guarantee
whatsoever that these tasks will wakeup where they went to sleep so if
we can achieve the same throughput without nr_iowaiters it shouldn't
be relevant.
I spent quite some time in fixing teo [2], because untangling nr_iowaiters
from menu seems hard, essentially nobody has worked on menu seriously for
a while now. Thus the plan here is to replace menu by teo eventually.
For your io_uring workloads I see throughput on par for teo (doesn't rely
on iowait) and menu.

# echo teo > /sys/devices/system/cpu/cpuidle/current_governor
#  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
submitter=0, tid=206, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
Engine=preadv2
IOPS=22500, BW=87MiB/s, IOS/call=0/0
IOPS=21916, BW=85MiB/s, IOS/call=1/0
IOPS=21774, BW=85MiB/s, IOS/call=1/0
IOPS=22467, BW=87MiB/s, IOS/call=1/0
Exiting on timeout
Maximum IOPS=22500
# echo menu > /sys/devices/system/cpu/cpuidle/current_governor
[  178.754571] cpuidle: using governor menu
#  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
submitter=0, tid=209, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
Engine=preadv2
IOPS=21452, BW=83MiB/s, IOS/call=0/0
IOPS=21778, BW=85MiB/s, IOS/call=1/0
IOPS=21120, BW=82MiB/s, IOS/call=1/0
IOPS=20903, BW=81MiB/s, IOS/call=1/0
Exiting on timeout
Maximum IOPS=21778

Please do give it a try for yourself as well!

2. For cpufreq:
Main issue for IO-bound workloads with iowait boosting is we're punishing
the 'good' workloads (that don't have iowait sleeps in their throughput-critical
part, which is already bad because of the scheduling overhead induced) by
making them energy-inefficient to make synthetic benchmarks happy.
A study of more realistic workloads show that they don't suffer from a problem
of building up utilization, not util_est anyway, so they don't actually benefit
from a cpufreq boost.
This leads me to the conclusion that cpufreq iowait boosting can be scrapped
altogether if we accept some degradation of benchmarks like
./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
or
fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
(non-io_uring) for that matter.

For io_uring where the expected case is probably not single-threaded sync IO
(or iodepth=1) the cpufreq iowait boost is just hurting use-cases by pushing
it to less efficient frequencies that might not be needed.

I know you want your problem (io_uring showing up as 100% busy even though it's
just sleeping) to be solved like yesterday and my opinion on a future timeline
might not be enough to convince you of much. I wanted to share it anyway.
I don't see an issue with the actual code you're proposing, but it does feel
like a step in the wrong direction to me.

[0] https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/
v2: https://lore.kernel.org/lkml/20240518113947.2127802-1-christian.loehle@arm.com/
[1] https://www.youtube.com/watch?v=MSQGEsSziZ4
[2] https://lore.kernel.org/lkml/20240628095955.34096-1-christian.loehle@arm.com/

Regards,
Christian

> [snip]