lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 27 Jun 2024 09:32:01 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com, 
	Linux Memory Management List <linux-mm@...ck.org>, Christian Brauner <brauner@...nel.org>, linux-kernel@...r.kernel.org, 
	ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
 -33.7% regression

On Thu, 27 Jun 2024 at 00:00, Mateusz Guzik <mjguzik@...il.com> wrote:
>
> I'm arranging access to a 128-way machine to prod, will report back
> after I can profile on it.

Note that the bigger problem is probably not the 128-way part, but the
"two socket" part.

>From a quick look at the profile data, we have for self-cycles:

  shell1 subtest:
      +4.2   lockref_get_not_dead
      +4.5   lockref_put_return
     +16.8   native_queued_spin_lock_slowpath

  getdent subtest:
      +4.1   lockref_put_return
      +5.7   lockref_get_not_dead
     +68.0   native_queued_spin_lock_slowpath

which means that the spin lock got much more expensive, _despite_ the
"fast path" in lockref being more aggressive.

Which in turn implies that the problem may be at least partly simply
due to much more cacheline ping-pong. In particular, the lockref
routines may be "fast", but they hit that one cacheline over and over
again and have a thundering herd issue, while the queued spinlocks on
their own actually try to avoid that for multiple CPU's.

IOW, the queue in the queued spinlocks isn't primarily about fairness
(although that is a thing), it's about not having all CPU cores
accessing the same spinlock cacheline.

Note that a lot of the other numbers seem "incidental". For example,
for the getdents subtest we have a lot of the numbers going down by
~55%, but while that looks like a big change, it's actually just a
direct result of this:

     -56.5%  stress-ng.getdent.ops

iow, the benchmark fundamentally did 56% less work.

IOW, I think this may all be fundamental, and we just can't do the
"wait for spinlock" thing, because that whole loop with a cpu_relax()
is just deadly.

And we've seen those kinds of busy-loops be huge problems before. When
you have this kind of busy-loop:

    old.lock_count = READ_ONCE(lockref->lock_count);
    do {
        if (lockref_locked(old)) {
            cpu_relax();
            old.lock_count = READ_ONCE(lockref->lock_count);
            continue;
        }

the "cpu_relax()" is horrendously expensive, but not having it is not
really an option either, since it will just cause a tight core-only
loop.

I suspect removing the cpu_relax() would help performance, but I
suspect the main help would come from it effectively cutting down the
wait cycles to practically nothing.

               Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ