[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whA2ztAcVrgsqj39j30LJYhjBSkk6Dju6TY16zGpXpkZQ@mail.gmail.com>
Date: Thu, 18 May 2023 17:41:29 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Tejun Heo <tj@...nel.org>
Cc: jiangshanlai@...il.com, peterz@...radead.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com,
joshdon@...gle.com, brho@...gle.com, briannorris@...omium.org,
nhuck@...gle.com, agk@...hat.com, snitzer@...nel.org,
void@...ifault.com
Subject: Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue
execution locality
On Thu, May 18, 2023 at 5:17 PM Tejun Heo <tj@...nel.org> wrote:
>
> Most of the patchset are workqueue internal plumbing and probably aren't
> terribly interesting. Howver, the performance picture turned out less
> straight-forward than I had hoped, mostly likely due to loss of
> work-conservation from scheduler in high fan-out scenarios. I'll describe it
> in this cover letter. Please read on.
So my reaction here is that I think your benchmarking was about
throughput, but the recent changes that triggered this discussion were
about latency for random small stuff.
Maybe your "LOW" tests might eb close to that, but looking at that fio
benchmark line you quoted, I don't think so.
IOW, I think that what the fsverity code ended up seeing was literally
*serial* IO that was fast enough that it was better done on the local
CPU immediately, and that that was the reason for why it wanted to
remove WQ_UNBOUND.
IOW, I think you should go even lower than your "LOW", and test
basically "--iodepth=1" to a ramdisk. A load where schedulign to any
other CPU is literally *always* a mistake, because the IO is basically
entirely synchronous, and it's better to just do the work on the same
CPU and be done with it.
That may sound like an outlier thing, but I don't think it's
necessarily even all that odd. I think that "depth=1" is likely the
common case for many real loads.
That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
read workqueue") really talks about startup costs. They are about
things like "page in the executable", which is all almost 100%
serialized with no parallelism at all. Even read-ahead ends up being
serial, in that it's likely one single contiguous IO.
Yes, latency tends to be harder to benchmark than throughput, but I
really think latency trumps throughput 95% of the time. And all your
benchmark loads looked like throughput loads to me: they just weren't
using *all* the CPU capacity you had.
Yes, writeback can have lovely throughput behavior and saturate the IO
because you have lots of parallelism. But reads are often 100% serial
for one thread, and often you don't *have* more than one thread.
So I think your "not enough work to saturate" is still ludicrously
over-the-top. You should not aim for "not enough work to saturate 24
threads". You should aim for "basically completely single-threaded".
Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
by at least an order of magnitude.
Linus
Powered by blists - more mailing lists