lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whKmuranma_HUOKBbDJHGmdWZr9MYW-+cmGzsOiJ2N1Sg@mail.gmail.com>
Date:   Thu, 14 Feb 2019 09:51:21 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Waiman Long <longman@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will.deacon@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        "linux-alpha@...r.kernel.org" <linux-alpha@...r.kernel.org>,
        "linux-alpha@...r.kernel.org" <linux-arm-kernel@...ts.infradead.org>,
        linux-hexagon@...r.kernel.org, linux-ia64@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org,
        Linux-sh list <linux-sh@...r.kernel.org>,
        sparclinux@...r.kernel.org, linux-xtensa@...ux-xtensa.org,
        linux-arch <linux-arch@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>
Subject: Re: [PATCH v3 2/2] locking/rwsem: Optimize down_read_trylock()

On Thu, Feb 14, 2019 at 6:53 AM Waiman Long <longman@...hat.com> wrote:
>
> The ARM64 result is what I would have expected given that the change was
> to optimize for the uncontended case. The x86-64 result is kind of an
> anomaly to me, but I haven't bothered to dig into that.

I would say that the ARM result is what I'd expect from something that
scales badly to begin with.

The x86-64 result is the expected one: yes, the cmpxchg is done one
extra time, but it results in fewer cache transitions (the cacheline
never goes into "shared" state), and cache transitions are what
matter.

The cost of re-doing the instruction should be low. The cacheline
ping-pong and the cache coherency messages is what hurts.

So I actually think both are very easily explained.

The x86-64 number improves, because there is less cache coherency traffic.

The arm64 numbers scaled horribly even before, and that's because
there is too much ping-pong, and it's probably because there is no
"stickiness" to the cacheline to the core, and thus adding the extra
loop can make the ping-pong issue even worse because now there is more
of it.

The cachelines not sticking at all to a core probably is good for
fairness issues (in particular, sticking *too* much can cause horrible
issues), but it's absolutely horrible if it means that you lose the
cacheline even before you get to complete the second cmpxchg.

                  Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ