linux-kernel - Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHFfH7d2AnWOeTNWF_BvF8qizxp+FMK6e1CyeOfTk0fLGQ@mail.gmail.com>
Date: Thu, 27 Jun 2024 18:55:02 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com, 
	Linux Memory Management List <linux-mm@...ck.org>, Christian Brauner <brauner@...nel.org>, linux-kernel@...r.kernel.org, 
	ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
 -33.7% regression

On Thu, Jun 27, 2024 at 6:32 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, 27 Jun 2024 at 00:00, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > I'm arranging access to a 128-way machine to prod, will report back
> > after I can profile on it.
>
> Note that the bigger problem is probably not the 128-way part, but the
> "two socket" part.
>

I know mate, I'm painfully aware of NUMA realities.

The bigger stuff I have intermittent access to is all multi socket,
the above is basically "need to use something bigger than my usual
trusty small core setup".

> From a quick look at the profile data, we have for self-cycles:
>
>   shell1 subtest:
>       +4.2   lockref_get_not_dead
>       +4.5   lockref_put_return
>      +16.8   native_queued_spin_lock_slowpath
>
>   getdent subtest:
>       +4.1   lockref_put_return
>       +5.7   lockref_get_not_dead
>      +68.0   native_queued_spin_lock_slowpath
>
> which means that the spin lock got much more expensive, _despite_ the
> "fast path" in lockref being more aggressive.
>
> Which in turn implies that the problem may be at least partly simply
> due to much more cacheline ping-pong. In particular, the lockref
> routines may be "fast", but they hit that one cacheline over and over
> again and have a thundering herd issue, while the queued spinlocks on
> their own actually try to avoid that for multiple CPU's.
>
> IOW, the queue in the queued spinlocks isn't primarily about fairness
> (although that is a thing), it's about not having all CPU cores
> accessing the same spinlock cacheline.
>
> Note that a lot of the other numbers seem "incidental". For example,
> for the getdents subtest we have a lot of the numbers going down by
> ~55%, but while that looks like a big change, it's actually just a
> direct result of this:
>
>      -56.5%  stress-ng.getdent.ops
>
> iow, the benchmark fundamentally did 56% less work.
>
> IOW, I think this may all be fundamental, and we just can't do the
> "wait for spinlock" thing, because that whole loop with a cpu_relax()
> is just deadly.
>
> And we've seen those kinds of busy-loops be huge problems before. When
> you have this kind of busy-loop:
>
>     old.lock_count = READ_ONCE(lockref->lock_count);
>     do {
>         if (lockref_locked(old)) {
>             cpu_relax();
>             old.lock_count = READ_ONCE(lockref->lock_count);
>             continue;
>         }
>
> the "cpu_relax()" is horrendously expensive, but not having it is not
> really an option either, since it will just cause a tight core-only
> loop.
>
> I suspect removing the cpu_relax() would help performance, but I
> suspect the main help would come from it effectively cutting down the
> wait cycles to practically nothing.
>

As far as lockref goes I had 2 ideas to test:
1. cpu_relax more than once, backoff style. Note the the total spin
count would still be bound before the routine gives up and locks. this
should reduce pulling
2. check how many spins would be needed to wait for before lockref
decides to fall back to locking

When messing around not automatically taking the lock I measured spin
counts with an artificially high limit. It was something like < 300 to
cover the cases I ran into (not this benchmark). For all I know the
limit can be bumped to -- say 256 -- and numerous acquires will
disappear at that scale, which should be good enough for
everybody(tm). All while forward progress is guaranteed.

That aside even that getdent thing uses mutexes and there is something
odd going on there. I verified workers go to sleep when it is
avoidable, which disfigures the result. I don't know the scale of it,
it may be it is a tiny fraction of consumers, but I do see them on
offcpu tracing.

That's that for handwaving. I'm going to get the hw and get hard
profiling data + results.

-- 
Mateusz Guzik <mjguzik gmail.com>