linux-kernel - Re: [PATCHSET v6 0/4] Split iowait into two states

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a0f2192-b897-4952-b4ea-7fe229f33001@kernel.dk>
Date: Wed, 21 Aug 2024 09:04:12 -0600
From: Jens Axboe <axboe@...nel.dk>
To: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, tglx@...utronix.de,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCHSET v6 0/4] Split iowait into two states

On 8/21/24 8:54 AM, Christian Loehle wrote:
> On 8/19/24 16:39, Jens Axboe wrote:
>> Hi,
>>
>> This is v6 of the patchset where the current in_iowait state is split
>> into two parts:
>>
>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>    in terms of sleep and wakeup latencies.
>> 2) The above, and also accounted as such in the iowait stats.
>>
>> The current ->in_iowait covers both, this series splits it into two types
>> of state so that each can be controlled seperately.
> 
> Hi Jens,
> I wanted to give a brief update on where I think we're at in terms
> of iowait behavior regarding cpuidle and cpufreq.
> I'm still working on getting both removed, given the discussions had
> on the list [0] and at OSPM [1] this seems realistic and the best way
> forward IMO.
> That would then naturally make this series and the iowait workaround in
> io_uring/io_uring.c unnecessary.
> 
> 1. For cpuidle:
> Main issue with relying on nr_iowaiters is that there is no guarantee
> whatsoever that these tasks will wakeup where they went to sleep so if
> we can achieve the same throughput without nr_iowaiters it shouldn't
> be relevant.
> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
> from menu seems hard, essentially nobody has worked on menu seriously for
> a while now. Thus the plan here is to replace menu by teo eventually.
> For your io_uring workloads I see throughput on par for teo (doesn't rely
> on iowait) and menu.
> 
> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
> Engine=preadv2
> IOPS=22500, BW=87MiB/s, IOS/call=0/0
> IOPS=21916, BW=85MiB/s, IOS/call=1/0
> IOPS=21774, BW=85MiB/s, IOS/call=1/0
> IOPS=22467, BW=87MiB/s, IOS/call=1/0
> Exiting on timeout
> Maximum IOPS=22500
> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
> [  178.754571] cpuidle: using governor menu
> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
> Engine=preadv2
> IOPS=21452, BW=83MiB/s, IOS/call=0/0
> IOPS=21778, BW=85MiB/s, IOS/call=1/0
> IOPS=21120, BW=82MiB/s, IOS/call=1/0
> IOPS=20903, BW=81MiB/s, IOS/call=1/0
> Exiting on timeout
> Maximum IOPS=21778
> 
> Please do give it a try for yourself as well!
> 
> 2. For cpufreq:
> Main issue for IO-bound workloads with iowait boosting is we're punishing
> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
> part, which is already bad because of the scheduling overhead induced) by
> making them energy-inefficient to make synthetic benchmarks happy.
> A study of more realistic workloads show that they don't suffer from a problem
> of building up utilization, not util_est anyway, so they don't actually benefit
> from a cpufreq boost.
> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
> altogether if we accept some degradation of benchmarks like
> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
> or
> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
> (non-io_uring) for that matter.

The original iowait addition came because a big regression was seen
compared to not setting iowait, it was around 20% iirc. That's big, and
not in the realm of "some degradation" that will be acceptable. And that
will largely depend on the system being used. On some systems, it'll be
less, and on some it'll be more.

> For io_uring where the expected case is probably not single-threaded
> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
> use-cases by pushing it to less efficient frequencies that might not
> be needed.

People do all sorts of things, and sync (or low queue depth) IO is
certainly one of the use cases. In fact that's where the above report
came from, on the postgres aio side.

> I know you want your problem (io_uring showing up as 100% busy even
> though it's just sleeping) to be solved like yesterday and my opinion
> on a future timeline might not be enough to convince you of much. I
> wanted to share it anyway. I don't see an issue with the actual code
> you're proposing, but it does feel like a step in the wrong direction
> to me.

As mentioned in my original reply, I view this as entirely orthogonal,
and while I appreciate your efforts in this area, I'm a little tired of
this being brought up as a gatekeeping metric when it's not there.

If we can eliminate iowait for boosting down the line, then I'm all for
it. But this has now been pending for > 6 months and I don't think it's
far to keep stringing this along on a future promise. This isn't a lot
of code and it solves the issue for now, if the code will get removed
down the line as not needed, then that's certainly fine. For now, we
need it.

-- 
Jens Axboe