[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180713090637.GA10601@andrea>
Date: Fri, 13 Jul 2018 11:07:11 +0200
From: Andrea Parri <andrea.parri@...rulasolutions.com>
To: Daniel Lustig <dlustig@...dia.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Paul McKenney <paulmck@...ux.vnet.ibm.com>,
Alan Stern <stern@...land.harvard.edu>,
Will Deacon <will.deacon@....com>,
Akira Yokosawa <akiyks@...il.com>,
Boqun Feng <boqun.feng@...il.com>,
David Howells <dhowells@...hat.com>,
Jade Alglave <j.alglave@....ac.uk>,
Luc Maranget <luc.maranget@...ia.fr>,
Nick Piggin <npiggin@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and
remove it for ordinary release/acquire
On Thu, Jul 12, 2018 at 07:05:39PM -0700, Daniel Lustig wrote:
> On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> > On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra <peterz@...radead.org> wrote:
> >>
> >> The locking pattern is fairly simple and shows where RCpc comes apart
> >> from expectation real nice.
> >
> > So who does RCpc right now for the unlock-lock sequence? Somebody
> > mentioned powerpc. Anybody else?
> >
> > How nasty would be be to make powerpc conform? I will always advocate
> > tighter locking and ordering rules over looser ones..
> >
> > Linus
>
> RISC-V probably would have been RCpc if we weren't having this discussion.
> Depending on how we map atomics/acquire/release/unlock/lock, we can end up
> producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
> behaviors, and we're trying to figure out which we actually need.
>
> I think the debate is this:
>
> Obviously programmers would prefer just to have RCsc and not have to figure out
> all the complexity of the other options. On x86 or architectures with native
> RCsc operations (like ARMv8), that's generally easy enough to get.
>
> For weakly-ordered architectures that use fences for ordering (including
> PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
> from RCpc to either "RCtso" or RCsc. People using these architectures are
> concerned about whether there's a negative performance impact from those extra
> fences.
>
> However, some scheduler code, some RCU code, and probably some other examples
> already implicitly or explicitly assume unlock()/lock() provides stronger
> ordering than RCpc. So, we have to decide whether to:
> 1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
> PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
> 2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
> assumes something stronger than RCpc is being provided, and hope people don't
> get it wrong in the future
> 3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
> smp_cond_load_acquire() be only RCpc
>
> Also, FWIW, if other weakly-ordered architectures come along in the future and
> also use any kind of lightweight fence rather than native RCsc operations,
> they'll likely be in the same boat as RISC-V and Power here, in the sense of
> not providing RCsc by default either.
>
> Is that a fair assessment everyone?
It's for me, thank you! And as we've seen, there are arguments for each of
the above three choices. I'm afraid that (despite Linus's statement ;-)),
my preference would currently go to (2).
>
>
>
> I can also not-so-briefly summarize RISC-V's status here, since I think there's
> been a bunch of confusion about where we're coming from:
>
> First of all, I promise we're not trying to start a fight about all this :)
> We're trying to understand the LKMM requirements so we know what instructions
> to use.
>
> With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
> store-conditional, all of which have RCsc .aq and .rl bits:
>
> (a) ...
> amoswap.w.rl x0, x0, [lock] // unlock()
> ...
> loop:
> amoswap.w.aq a0, t1, [lock] // lock()
> bnez a0, loop // lock()
> (b) ...
>
> (a) is ordered before (b) here, regardless of (a) and (b). Likewise for our
> load-reserved/store-conditional instructions, which also have .aq and rl.
> That's similiar to how ARM behaves, and is no problem. We're happy with that
> too.
>
> Unfortunately, we don't (currently?) have plain load-acquire or store-release
> opcodes in the ISA. (That's a different discussion...) For those, we need
> fences instead. And that's where it gets messier.
>
> RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
> "natural" fence-based mapping for store-release operations, and then pair that
> with LR/SC:
>
> (a) ...
> fence rw,w // unlock()
> sw x0, [lock] // unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> However, if (a) and (b) are loads to different addresses, then (a) is not
> ordered before (b) here. One unpaired RCsc operation is not a full fence.
> Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
> depend on "RCtso" or RCsc.
>
> RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:
Or by using a "fence r,rw" in the lock() (without the .aq), as current code
does ;-) though I'm not sure how the current solution would compare to the
.tso mapping...
Andrea
>
> (a) ...
> fence.tso // unlock(), fence.tso == fence rw,w + fence r,r
> sw x0, [lock] // unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> (a) is ordered before (b), unless (a) is a store and (b) is a load to a
> different address.
>
> (Modeling note: this example is why I asked for Alan's v3 patch over the v2
> patch, which I believe would only have worked if the fence.tso were at the end)
>
> To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
> the lock load, much like PowerPC would I believe need a heavyweight sync:
>
> (a) ...
> fence rw,w // unlock()
> sw x0, [lock] // unlock()
> ...
> fence rw,rw // can attach either to lock() or to unlock()
> ...
> loop:
> lr.w.aq a0, [lock] // lock()
> sc.w t1, [lock] // lock()
> bnez loop // lock()
> (b) ...
>
> In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
> RISC-V's fence rw,rw will suffice wherever PowerPC's full sync does. If anyone
> is claiming RISC-V is suddenly proposing to go weaker than all the other major
> architectures, that's a mischaracterization.
>
> All in all: if LKMM wants RCsc, we can do it, but it's not free for RISC-V (or
> Power). If LKMM wants RCtso, we can do that too, and that's in between. If
> LKMM wants RCpc, we can do that, and it's the fastest of the bunch. No I don't
> have concrete numbers either... And RISC-V implementations are going to vary
> pretty widely anyway.
>
> Hope that helps. Please correct anything I screwed up or mischaracterized.
>
> Dan
Powered by blists - more mailing lists