linux-kernel - Re: [PATCH v2 2/2] locking/rwsem: Optimize down_read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ef66f262-d68e-58ea-34f6-15b88803cf71@redhat.com>
Date:   Tue, 12 Feb 2019 16:21:10 -0500
From:   Waiman Long <longman@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will.deacon@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        linux-alpha@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        linux-hexagon@...r.kernel.org, linux-ia64@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org,
        Linux-sh list <linux-sh@...r.kernel.org>,
        sparclinux@...r.kernel.org, linux-xtensa@...ux-xtensa.org,
        linux-arch <linux-arch@...r.kernel.org>,
        the arch/x86 maintainers <x86@...nel.org>,
        Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>
Subject: Re: [PATCH v2 2/2] locking/rwsem: Optimize down_read_trylock()

On 02/12/2019 02:58 PM, Linus Torvalds wrote:
> On Mon, Feb 11, 2019 at 11:31 AM Waiman Long <longman@...hat.com> wrote:
>> Modify __down_read_trylock() to make it generate slightly better code
>> (smaller and maybe a tiny bit faster).
> This looks good, but I would ask you to try one slightly different approach.
>
> Instead of this:
>
>>        long tmp = atomic_long_read(&sem->count);
>>
>>        while (tmp >= 0) {
>>                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>>                                        tmp + RWSEM_ACTIVE_READ_BIAS)) {
>>                        return 1;
>>                }
>>        }
> try doing this instead:
>
>         long tmp = 0;
>
>         do {
>                 if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
>                                         tmp + RWSEM_ACTIVE_READ_BIAS)) {
>                         return 1;
>         } while (tmp >= 0);
>         return 0;
>
> because especially when it comes to locking, it's usually better to
> just *guess* that the lock is unlocked, than it is to actually read
> from the line to see what the state is.
>
> Often - but certainly not always - the lock is the first access to the
> target cacheline, and assuming the trylock is successful (which I
> think is the case we want to optimize for), we're much better off
> causing that first access to be a read-for-ownership, rather than a
> read-for-sharing.
>
> Because if you first read from the line, and then do a cmpxchg, and if
> the line was not in the cache, your cache coherency protocol will
> generally go through two states: first shared (for the initial read)
> and then exclusive-dirty (for the cmpxchg).
>
> Now, this is obviously very micro-architecture dependent, and in fact
> the microarchitecture could even see the "predict fallthrough to a
> cmpxchg with the same address" and turn the first read into a
> read-for-ownership, but we've done this at some point before, and the
> "guess unlocked" was actually the one that performed better.
>
> Of course, the downside is that it might be worse when the guess is
> incorrect - either because of a nested read lock or due to an actual
> conflict with a write, but on the whole those *should* be the rare
> cases, and not the cases where we necessarily optimize for latency of
> the operation.
>
> Hmm?

I looked at the assembly code in arch/x86/include/asm/rwsem.h. For both
trylocks (read & write), the count is read first before attempting to
lock it. We did the same for all trylock functions in other locks.
Depending on how the trylock is used and how contended the lock is, it
may help or hurt performance. Changing down_read_trylock to do an
unconditional cmpxchg will change the performance profile of existing
code. So I would prefer keeping the current code.

I do notice now that the generic down_write_trylock() code is doing an
unconditional compxchg. So I wonder if we should change it to read the
lock first like other trylocks or just leave it as it is.

Cheers,
Longman