[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wg7PXo_QbBo8gv27OpbMgAwLh9H46kJRxAmp0FL0QD7HA@mail.gmail.com>
Date: Thu, 27 Jun 2024 09:32:01 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com,
Linux Memory Management List <linux-mm@...ck.org>, Christian Brauner <brauner@...nel.org>, linux-kernel@...r.kernel.org,
ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
-33.7% regression
On Thu, 27 Jun 2024 at 00:00, Mateusz Guzik <mjguzik@...il.com> wrote:
>
> I'm arranging access to a 128-way machine to prod, will report back
> after I can profile on it.
Note that the bigger problem is probably not the 128-way part, but the
"two socket" part.
>From a quick look at the profile data, we have for self-cycles:
shell1 subtest:
+4.2 lockref_get_not_dead
+4.5 lockref_put_return
+16.8 native_queued_spin_lock_slowpath
getdent subtest:
+4.1 lockref_put_return
+5.7 lockref_get_not_dead
+68.0 native_queued_spin_lock_slowpath
which means that the spin lock got much more expensive, _despite_ the
"fast path" in lockref being more aggressive.
Which in turn implies that the problem may be at least partly simply
due to much more cacheline ping-pong. In particular, the lockref
routines may be "fast", but they hit that one cacheline over and over
again and have a thundering herd issue, while the queued spinlocks on
their own actually try to avoid that for multiple CPU's.
IOW, the queue in the queued spinlocks isn't primarily about fairness
(although that is a thing), it's about not having all CPU cores
accessing the same spinlock cacheline.
Note that a lot of the other numbers seem "incidental". For example,
for the getdents subtest we have a lot of the numbers going down by
~55%, but while that looks like a big change, it's actually just a
direct result of this:
-56.5% stress-ng.getdent.ops
iow, the benchmark fundamentally did 56% less work.
IOW, I think this may all be fundamental, and we just can't do the
"wait for spinlock" thing, because that whole loop with a cpu_relax()
is just deadly.
And we've seen those kinds of busy-loops be huge problems before. When
you have this kind of busy-loop:
old.lock_count = READ_ONCE(lockref->lock_count);
do {
if (lockref_locked(old)) {
cpu_relax();
old.lock_count = READ_ONCE(lockref->lock_count);
continue;
}
the "cpu_relax()" is horrendously expensive, but not having it is not
really an option either, since it will just cause a tight core-only
loop.
I suspect removing the cpu_relax() would help performance, but I
suspect the main help would come from it effectively cutting down the
wait cycles to practically nothing.
Linus
Powered by blists - more mailing lists