linux-kernel - Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com>
Date:   Thu, 12 Jul 2018 19:05:39 -0700
From:   Daniel Lustig <dlustig@...dia.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>
CC:     Paul McKenney <paulmck@...ux.vnet.ibm.com>,
        Alan Stern <stern@...land.harvard.edu>,
        "andrea.parri@...rulasolutions.com" 
        <andrea.parri@...rulasolutions.com>,
        Will Deacon <will.deacon@....com>,
        Akira Yokosawa <akiyks@...il.com>,
        Boqun Feng <boqun.feng@...il.com>,
        David Howells <dhowells@...hat.com>,
        Jade Alglave <j.alglave@....ac.uk>,
        Luc Maranget <luc.maranget@...ia.fr>,
        Nick Piggin <npiggin@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and
 remove it for ordinary release/acquire

On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <peterz@...radead.org> wrote:
>>
>> The locking pattern is fairly simple and shows where RCpc comes apart
>> from expectation real nice.
> 
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
> 
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
> 
>             Linus

RISC-V probably would have been RCpc if we weren't having this discussion.
Depending on how we map atomics/acquire/release/unlock/lock, we can end up
producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
behaviors, and we're trying to figure out which we actually need.

I think the debate is this:

Obviously programmers would prefer just to have RCsc and not have to figure out
all the complexity of the other options.  On x86 or architectures with native
RCsc operations (like ARMv8), that's generally easy enough to get.

For weakly-ordered architectures that use fences for ordering (including
PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
from RCpc to either "RCtso" or RCsc.  People using these architectures are
concerned about whether there's a negative performance impact from those extra
fences.

However, some scheduler code, some RCU code, and probably some other examples
already implicitly or explicitly assume unlock()/lock() provides stronger
ordering than RCpc.  So, we have to decide whether to:
1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
assumes something stronger than RCpc is being provided, and hope people don't
get it wrong in the future
3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
smp_cond_load_acquire() be only RCpc

Also, FWIW, if other weakly-ordered architectures come along in the future and
also use any kind of lightweight fence rather than native RCsc operations,
they'll likely be in the same boat as RISC-V and Power here, in the sense of
not providing RCsc by default either.

Is that a fair assessment everyone?



I can also not-so-briefly summarize RISC-V's status here, since I think there's
been a bunch of confusion about where we're coming from:

First of all, I promise we're not trying to start a fight about all this :)
We're trying to understand the LKMM requirements so we know what instructions
to use.

With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
store-conditional, all of which have RCsc .aq and .rl bits:

  (a) ...
  amoswap.w.rl x0, x0, [lock]  // unlock()
  ...
loop:
  amoswap.w.aq a0, t1, [lock]  // lock()
  bnez a0, loop                // lock()
  (b) ...

(a) is ordered before (b) here, regardless of (a) and (b).  Likewise for our
load-reserved/store-conditional instructions, which also have .aq and rl.
That's similiar to how ARM behaves, and is no problem.  We're happy with that
too.

Unfortunately, we don't (currently?) have plain load-acquire or store-release
opcodes in the ISA.  (That's a different discussion...)  For those, we need
fences instead.  And that's where it gets messier.

RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
"natural" fence-based mapping for store-release operations, and then pair that
with LR/SC:

  (a) ...
  fence rw,w     // unlock()
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock]     // lock()
  bnez loop           // lock()
  (b) ...

However, if (a) and (b) are loads to different addresses, then (a) is not
ordered before (b) here.  One unpaired RCsc operation is not a full fence.
Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
depend on "RCtso" or RCsc.

RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:

  (a) ...
  fence.tso      // unlock(), fence.tso == fence rw,w + fence r,r
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock]     // lock()
  bnez loop           // lock()
  (b) ...

(a) is ordered before (b), unless (a) is a store and (b) is a load to a
different address.

(Modeling note: this example is why I asked for Alan's v3 patch over the v2
patch, which I believe would only have worked if the fence.tso were at the end)

To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
the lock load, much like PowerPC would I believe need a heavyweight sync:

  (a) ...
  fence rw,w     // unlock()
  sw x0, [lock]  // unlock()
  ...
  fence rw,rw    // can attach either to lock() or to unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock]     // lock()
  bnez loop           // lock()
  (b) ...

In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
RISC-V's fence rw,rw will suffice wherever PowerPC's full sync does.  If anyone
is claiming RISC-V is suddenly proposing to go weaker than all the other major
architectures, that's a mischaracterization.

All in all: if LKMM wants RCsc, we can do it, but it's not free for RISC-V (or
Power).  If LKMM wants RCtso, we can do that too, and that's in between.  If
LKMM wants RCpc, we can do that, and it's the fastest of the bunch.  No I don't
have concrete numbers either...  And RISC-V implementations are going to vary
pretty widely anyway.

Hope that helps.  Please correct anything I screwed up or mischaracterized.

Dan