[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190501160140.GC28109@fuggles.cambridge.arm.com>
Date: Wed, 1 May 2019 17:01:40 +0100
From: Will Deacon <will.deacon@....com>
To: Jan Glauber <jglauber@...vell.com>
Cc: "catalin.marinas@....com" <catalin.marinas@....com>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Jayachandran Chandrasekharan Nair <jnair@...vell.com>,
peterz@...radead.org, torvalds@...ux-foundation.org
Subject: Re: [RFC] Disable lockref on arm64
Hi Jan,
[+Peter and Linus, since they enjoy this stuff]
On Mon, Apr 29, 2019 at 02:52:11PM +0000, Jan Glauber wrote:
> I've been looking into performance issues that were reported for several
> test-cases, for instance an nginx benchmark.
Could you share enough specifics here so that we can reproduce the issue
locally, please? That would help us in our attempts to develop a fix without
simply disabling the option for everybody else.
> It turned out the issue we have on ThunderX2 is the file open-close sequence
> with small read sizes. If the used files are opened read-only the
> lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
>
> The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> spinlock isn't taken) while loop to change the lock count. This behaves
> badly under heavy contention (~25x retries for one cmpxchg to succeed
> with 28 threads operating on the same file). In case of a NUMA system
> it also behaves badly as the access from the other socket is much slower.
It's surprising that this hasn't been reported on x86. I suspect their
implementation of cmpxchg is a little more forgiving under contention.
> The fact that on ThunderX2 cpu_relax() turns only into one NOP
> instruction doesn't help either. On Intel pause seems to block the thread
> much longer, avoiding the heavy contention thereby.
NOPing out the yield instruction seems like a poor choice for an SMT CPU
such as TX2. That said, the yield was originally added to cpu_relax() as
a scheduling hint for QEMU.
> With the queued spinlocks implementation I can see a major improvement
> when I disable lockref. A trivial open-close test-case improves by
> factor 2 while system time is decreasing also 2x. Looking at kernel compile
> and dbench numbers didn't show any regression with lockref disabled.
>
> Can we simply disable lockref? Is anyone else seeing this issue? Is there
> an arm64 platform that actually implements yield?
There are two issues with disabling lockref like this:
1. It's a compile-time thing, so systems that would benefit from the code
are unfairly penalised.
2. You're optimising for the contended case at the cost of the
uncontended case, which should actually be the common case as well.
Now, nobody expects contended CAS to scale well, so the middle ground
probably involves backing off to the lock under contention, a bit like
an optimistic trylock(). Unfortunately, that will need some tuning, hence
my initial request for a reproducer.
Cheers,
Will
Powered by blists - more mailing lists