lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 1 May 2019 17:01:40 +0100
From:   Will Deacon <>
To:     Jan Glauber <>
Cc:     "" <>,
        "" <>,
        Jayachandran Chandrasekharan Nair <>,,
Subject: Re: [RFC] Disable lockref on arm64

Hi Jan,

[+Peter and Linus, since they enjoy this stuff]

On Mon, Apr 29, 2019 at 02:52:11PM +0000, Jan Glauber wrote:
> I've been looking into performance issues that were reported for several
> test-cases, for instance an nginx benchmark.

Could you share enough specifics here so that we can reproduce the issue
locally, please? That would help us in our attempts to develop a fix without
simply disabling the option for everybody else.

> It turned out the issue we have on ThunderX2 is the file open-close sequence
> with small read sizes. If the used files are opened read-only the
> lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
> The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> spinlock isn't taken) while loop to change the lock count. This behaves
> badly under heavy contention (~25x retries for one cmpxchg to succeed
> with 28 threads operating on the same file). In case of a NUMA system
> it also behaves badly as the access from the other socket is much slower.

It's surprising that this hasn't been reported on x86. I suspect their
implementation of cmpxchg is a little more forgiving under contention.

> The fact that on ThunderX2 cpu_relax() turns only into one NOP
> instruction doesn't help either. On Intel pause seems to block the thread
> much longer, avoiding the heavy contention thereby.

NOPing out the yield instruction seems like a poor choice for an SMT CPU
such as TX2. That said, the yield was originally added to cpu_relax() as
a scheduling hint for QEMU.

> With the queued spinlocks implementation I can see a major improvement
> when I disable lockref. A trivial open-close test-case improves by
> factor 2 while system time is decreasing also 2x. Looking at kernel compile
> and dbench numbers didn't show any regression with lockref disabled.
> Can we simply disable lockref? Is anyone else seeing this issue? Is there
> an arm64 platform that actually implements yield?

There are two issues with disabling lockref like this:

  1. It's a compile-time thing, so systems that would benefit from the code
     are unfairly penalised.

  2. You're optimising for the contended case at the cost of the
     uncontended case, which should actually be the common case as well.

Now, nobody expects contended CAS to scale well, so the middle ground
probably involves backing off to the lock under contention, a bit like
an optimistic trylock(). Unfortunately, that will need some tuning, hence
my initial request for a reproducer.



Powered by blists - more mailing lists