lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 26 Aug 2022 09:45:47 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Mikulas Patocka <mpatocka@...hat.com>
Cc:     Alan Stern <stern@...land.harvard.edu>,
        Andrea Parri <parri.andrea@...il.com>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Nicholas Piggin <npiggin@...il.com>,
        David Howells <dhowells@...hat.com>,
        Jade Alglave <j.alglave@....ac.uk>,
        Luc Maranget <luc.maranget@...ia.fr>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Akira Yokosawa <akiyks@...il.com>,
        Daniel Lustig <dlustig@...dia.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org
Subject: Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

On Fri, Aug 26, 2022 at 6:17 AM Mikulas Patocka <mpatocka@...hat.com> wrote:
>
> I wouldn't do this for regular test_bit because if you read memory with
> different size/alignment from what you wrote, various CPUs suffer from
> store->load forwarding penalties.

All of the half-way modern CPU's do ok with store->load forwarding as
long as the load is fully contained in the store, so I suspect the
'testb' model is pretty much universally better than loading a word
into a register and testing it there.

So narrowing the load is fine (but you generally never want to narrow
a *store*, because that results in huge problems with subsequent wider
loads).

But it's not a huge deal, and this way if somebody actually runs the
numbers and does any comparisons, we have both versions, and if the
'testb' is better, we can just rename the x86
constant_test_bit_acquire() to just constant_test_bit() and use it for
both cases.

> But for test_bit_acqure this optimization is likely harmless because the
> bit will not be tested a few instructions after writing it.

Note that if we really do that, then we've already lost because of the
volatile access, ie if we cared about a "write bit, test bit" pattern,
we should use other operations.

Now, the new "const_test_bit()" logic (commits bb7379bfa680 "bitops:
define const_*() versions of the non-atomics" and 0e862838f290
"bitops: unify non-atomic bitops prototypes across architectures")
means that as long as you are setting and testing a bit in a local
variable, it gets elided entirely. But in general use you're going to
see that load from memory, and then the wider load is likely worse
(because bigger constants, and because it requries a register).

So maybe there is room to tweak it further, but this version of the
patch looks good to me, and I've applied it.

Thanks,
                Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ