[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0524d3fa-c70e-45a5-96e1-317431f6ce8d@kernel.dk>
Date: Sat, 24 Aug 2024 09:34:05 -0600
From: Jens Axboe <axboe@...nel.dk>
To: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org
Cc: peterz@...radead.org, tglx@...utronix.de,
Vincent Guittot <vincent.guittot@...aro.org>,
Qais Yousef <qyousef@...alina.io>, "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCHSET v6 0/4] Split iowait into two states
On 8/21/24 9:57 AM, Christian Loehle wrote:
> On 8/21/24 16:04, Jens Axboe wrote:
>> On 8/21/24 8:54 AM, Christian Loehle wrote:
>>> On 8/19/24 16:39, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> This is v6 of the patchset where the current in_iowait state is split
>>>> into two parts:
>>>>
>>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>> in terms of sleep and wakeup latencies.
>>>> 2) The above, and also accounted as such in the iowait stats.
>>>>
>>>> The current ->in_iowait covers both, this series splits it into two types
>>>> of state so that each can be controlled seperately.
>>>
>>> Hi Jens,
>>> I wanted to give a brief update on where I think we're at in terms
>>> of iowait behavior regarding cpuidle and cpufreq.
>>> I'm still working on getting both removed, given the discussions had
>>> on the list [0] and at OSPM [1] this seems realistic and the best way
>>> forward IMO.
>>> That would then naturally make this series and the iowait workaround in
>>> io_uring/io_uring.c unnecessary.
>>>
>>> 1. For cpuidle:
>>> Main issue with relying on nr_iowaiters is that there is no guarantee
>>> whatsoever that these tasks will wakeup where they went to sleep so if
>>> we can achieve the same throughput without nr_iowaiters it shouldn't
>>> be relevant.
>>> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
>>> from menu seems hard, essentially nobody has worked on menu seriously for
>>> a while now. Thus the plan here is to replace menu by teo eventually.
>>> For your io_uring workloads I see throughput on par for teo (doesn't rely
>>> on iowait) and menu.
>>>
>>> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
>>> # ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>>> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
>>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>>> Engine=preadv2
>>> IOPS=22500, BW=87MiB/s, IOS/call=0/0
>>> IOPS=21916, BW=85MiB/s, IOS/call=1/0
>>> IOPS=21774, BW=85MiB/s, IOS/call=1/0
>>> IOPS=22467, BW=87MiB/s, IOS/call=1/0
>>> Exiting on timeout
>>> Maximum IOPS=22500
>>> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
>>> [ 178.754571] cpuidle: using governor menu
>>> # ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>>> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
>>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>>> Engine=preadv2
>>> IOPS=21452, BW=83MiB/s, IOS/call=0/0
>>> IOPS=21778, BW=85MiB/s, IOS/call=1/0
>>> IOPS=21120, BW=82MiB/s, IOS/call=1/0
>>> IOPS=20903, BW=81MiB/s, IOS/call=1/0
>>> Exiting on timeout
>>> Maximum IOPS=21778
>>>
>>> Please do give it a try for yourself as well!
>>>
>>> 2. For cpufreq:
>>> Main issue for IO-bound workloads with iowait boosting is we're punishing
>>> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
>>> part, which is already bad because of the scheduling overhead induced) by
>>> making them energy-inefficient to make synthetic benchmarks happy.
>>> A study of more realistic workloads show that they don't suffer from a problem
>>> of building up utilization, not util_est anyway, so they don't actually benefit
>>> from a cpufreq boost.
>>> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
>>> altogether if we accept some degradation of benchmarks like
>>> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>>> or
>>> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
>>> (non-io_uring) for that matter.
>>
>> The original iowait addition came because a big regression was seen
>> compared to not setting iowait, it was around 20% iirc. That's big, and
>> not in the realm of "some degradation" that will be acceptable. And that
>> will largely depend on the system being used. On some systems, it'll be
>> less, and on some it'll be more.
>
> We are also talking about power regressions of 1000% easily FWIW for
> e.g. fio --name=fio --rw=randread --bs=4k --runtime=10 --time_based
> --filename=/dev/nvme0n1 --iodepth=32 --numjobs=nr_cpus
> --ioengine=io_uring (without any throughput gain).
Oh I believe it, for some embeded or low power cpus. And it is on our
list, to make this selectable. Ideally what I think should happen is
that the application gives you a hint on how long it expects to sleep,
and we'll pass that on and let the lower layers decide what's the most
appropriate state to enter. The current iowait usage isn't very pretty
(in io_uring or otherwise, it's too coarse of a hint), but it's what we
have/had, and we needed it to solve a problem that would otherwise be a
regression on a much more common setup than really lower power devices.
>>> For io_uring where the expected case is probably not single-threaded
>>> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
>>> use-cases by pushing it to less efficient frequencies that might not
>>> be needed.
>>
>> People do all sorts of things, and sync (or low queue depth) IO is
>> certainly one of the use cases. In fact that's where the above report
>> came from, on the postgres aio side.
>
> I have looked at that and (on the platforms I've tested) that was indeed
> from cpuidle FWIW. Moving away from menu did remedy this with the
> mainlined teo fixes.
>
>>> I know you want your problem (io_uring showing up as 100% busy even
>>> though it's just sleeping) to be solved like yesterday and my opinion
>>> on a future timeline might not be enough to convince you of much. I
>>> wanted to share it anyway. I don't see an issue with the actual code
>>> you're proposing, but it does feel like a step in the wrong direction
>>> to me.
>>
>> As mentioned in my original reply, I view this as entirely orthogonal,
>> and while I appreciate your efforts in this area, I'm a little tired of
>> this being brought up as a gatekeeping metric when it's not there.
>
> I can understand you being tired of me bringing this up, but I'm not
> gatekeeping this series, not intentionally anyway.
Well it does feel like that, because this orthogonal (imho) development
is being brought up as a means to not needing to do this. Not just this
posting, but past ones too. Meanwhile, I'd like this problem solved, and
this just adds noise to it as far as I'm concerned. It would be a lot
better to split those two discussions up.
>> If we can eliminate iowait for boosting down the line, then I'm all for
>> it. But this has now been pending for > 6 months and I don't think it's
>> far to keep stringing this along on a future promise. This isn't a lot
>> of code and it solves the issue for now, if the code will get removed
>> down the line as not needed, then that's certainly fine. For now, we
>> need it.
>
> I'm fine with carrying a revert of the series along my patchset.
OK that's fine, and let's hope we end up in a place down the line that's
a lot better than the iowait on/off we have now, with guesswork based on
past behavior (iow, mostly wrong) on the other end on how long the we
expect to sleep. I'd certainly be all for that, I just don't want future
promises to stop fixing a real issue we have now. If this series goes
away down the line because we don't need it, I surely won't cry over it!
--
Jens Axboe
Powered by blists - more mailing lists