[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <df434f80-adb1-46b4-9502-f39b089ee4a3@kernel.dk>
Date: Tue, 27 Feb 2024 05:53:38 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Christian Loehle <christian.loehle@....com>,
LKML <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/core: split iowait state into two states
On 2/27/24 3:50 AM, Christian Loehle wrote:
> Hi Jens,
>
> On 26/02/2024 16:15, Jens Axboe wrote:
>> iowait is a bogus metric, but it's helpful in the sense that it allows
>> short waits to not enter sleep states that have a higher exit latency
>> than we would've picked for iowait'ing tasks. However, it's harmless in
>> that lots of applications and monitoring assumes that iowait is busy
>> time, or otherwise use it as a health metric. Particularly for async
>> IO it's entirely nonsensical.>
>> Split the iowait part into two parts - one that tracks whether we need
>> boosting for short waits, and one that says we need to account the task
>> as such. ->in_iowait_acct nests inside of ->in_iowait, both for
>> efficiency reasons, but also so that the relationship between the two
>> is clear. A waiter may set ->in_wait alone and not care about the
>> accounting>
>> Existing users of nr_iowait() for accounting purposes are switched to
>> use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
>> only cares about iowaiters, not the accounting side.
>>
>> io_schedule_prepare() and io_schedule_finish() are changed to return
>> a simple mask of two state bits, as we now have more than one state to
>> manage. Outside of that, no further changes are needed to suppor this
>> generically.
>> [snip]
>
> Actually there are probably three uses of the in_iowait flag
> 1. The (original) accounting use
> 2. The sleep state heuristic based on nr_iowaiters in cpuidle/governors/menu.c
> 3. The CPU frequency boost when in_iowait tasks wake up implemented in both
> intel_pstate.c and cpufreq_schedutil.c cpufreq governors.
>
> 2 & 3 have just been piggybacked onto 1 because they work somewhat, but as
> your patch also shows they really don't.
Right, I did collapse 2 & 3 into cpufreq related sleep/wakeup latencies.
> I have been working on a hopefully better approach for 3., I'll use
> your patch as a chance to reintroduce the problem. I was going to ask
> for your thoughts on the patch anyway.
>
> The piggybacking of 2 and 3 have (IMHO) more dire consequences than
> just the fact that you have to accept being accounted for as busy
> (until now) if you wanted to make use of 2 and 3.
>
> I assume the intention of your patch is to remove this link for the
> io_uring case in particular, given that AFAICT it's the only occurence
> actually affected by your patch (sets in_iowait directly and not the
> helper functions which will set both in_iowait and in_iowait_acct).
Right. It doesn't matter too much for storage as people kind of expect
iowait on that side, but for high frequency network IO (or just
networked IO in general), adding iowait to the mix tends to confuse
application owners. And since stat is mostly garbage anyway, I can
either spend time arguing with people that it's a useless metric, or I
can do something about it and just eliminate it on my side for good.
BTW, reading your email you seem to equate io_uring with storage, this
is very much not the case. Just wanted to clarify that this is in no way
storage specific.
> I think that is the right direction, but if we touch this stuff, can
> we also consider reworking it entirely? Let's take io_uring as an
> example, not because it's the worst, but because it's overhead is so
> low it shows the biggest problem (or room for improvement).
Sure, I have no objections to that, though I do want to fix the
immediate problem of just getting rid of iowait accounting. As I don't
think the next step is immediately obvious, I'd prefer if we can at
least fix the immediate issue and defer a rework to a step 2.
> The iowait boosting of the CPU frequency will currently lead to e.g.
> io_uring NR_CPUS threads with high enough iodepth (let's say 128) to
> possibly run all CPUs on the highest, or at least one of the higher
> OPPs (frequency and therefore power consumption). (fio --rw=randread
> --bs=4k --ioengine=io_uring --iodepth=128 --numjobs=$(nproc) for 12
> CPUs) if we're using e.g. cpufreq_schedutil.c on all of them. This is
> an issue as on many systems even running them on the lowest OPP
> suffices to saturate the storage device (using cpufreq governor
> powersave on all). The frequency boost based on iowait is therefore
> incredibly wasteful here and destroys the incredibly low overhead of
> io_uring and the impact it could have on energy being spent by the
> CPU.
To be honest, I think it's hard to generalize on that. For the above
example, it completely depends on what you're driving. If this is 12
CPUs doing IO to N devices, what kind of devices are these? Are they
doing millions of IOPS each, or is is 100k each? A storage device is
many things, and you can easily have a storage device that it would take
more than one CPU to fully saturate. Or you can have 12 of them that one
can easily saturate.
> Looking at git grep io_schedule and mutex_lock_io iowait currently
> means anything from actual block io over sending CXL transactions to
> waiting for DMA fences as a i915 GPU driver. These things are clearly
> very different and deserve distinct handling.
>
> Even if we remain in the realm of block io we have, as you already put
> it nicely, "for async IO it's entirely nonsensical", but it doesn't
> stop there. Writes in general have a similar problem, for some SSDs we
> just boost the CPU frequency to land a tiny bit earlier in the SSDs
> DRAM cache, where it will be flushed to flash at it's convenience (or
> necessity). Again, boosting being entirely wasteful. Boosting is of
> course also applied on periodic page cache writebacks for usually no
> good reason at all.
I don't think that is true at all. We don't boost to have data land in
the drive cache earlier, we boost so that:
1) prepare IO to device
2) submit IO to device
3) wait on IO completion, task goes to sleep
4) IO completes, wake task
5) task wakes up, gets completion
the last two steps here aren't burdened by latencies that are higher
than they need to be, IOW steps 2 & 3 from above. This is why I'm
bundling them into one, as they really are the same thing from that
perspective.
> I have a patch for 3 that (among other changes) tracks if the boost
> actually improved throughput (measured in the only way we currently
> can, iowait wakeups per time interval). I think it's an improvement
> over the current situation, but it's far from perfect.
>
> Ideally we would get the three different signals as distinct:
> 1. iowait_acct
> 2. iowait_short_sleep or something, we expect to wake up pretty soon
> due to some IO, (which in case of the block layer maybe there would
> even be some estimate when?)
> 3. iowait_util_boost to signal we are in a scenario where the time
> between iowaits (that the task is potentially using the CPU), is
> critical to IO throughput and therefore running it as quickly as
> possible is worth the energy spending of boosting.
>
>
> Ideally we (the sched folks) would move away from these
> iowait-piggybacked heuristics and try to get as much information as
> possible from the e.g. the block layer and act accordingly. At least
> for the iowait boosting of frequency I would claim the heuristics are
> wrong more often than not.
>
> Would love to hear your thoughts and thanks for the patch (and
> apologies for this scope-explosion, but I think the discussion is
> worth having).
As mentioned higher up, I do agree that there's room for improvement for
the heuristics in general, and I'll be more than happy to help test and
help with the block layer side or io_uring of things too. If we can get
good latencies when we need it and be cognizant of power at the same
time, that's certainly a win all around.
However, I would greatly prefer to sort out the mixing up of iowait
accounting and boosting first as it's a much simpler problem and
deserves fixing separately, and is not one that will inevitably get
complicated as it needs to coordinate across layers. Outside of that,
any change in heuristics for that will need considerable testing,
whereas the existing one is not encumbered by that.
I'll respin this version to try and avoid the atomics here, as that was
a comment that Peter had. If we can improve the existing nr_iowait
accounting and logic with that as well, then I think that's an exercise
that's worthwhile separately.
--
Jens Axboe
Powered by blists - more mailing lists