linux-kernel - Re: [v3,11/41] mips: reuse asm-generic/barrier.h

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160126061211.GK4503@linux.vnet.ibm.com>
Date:	Mon, 25 Jan 2016 22:12:11 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Will Deacon <will.deacon@....com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Leonid Yegoshin <Leonid.Yegoshin@...tec.com>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	linux-kernel@...r.kernel.org, Arnd Bergmann <arnd@...db.de>,
	linux-arch@...r.kernel.org,
	Andrew Cooper <andrew.cooper3@...rix.com>,
	Russell King - ARM Linux <linux@....linux.org.uk>,
	virtualization@...ts.linux-foundation.org,
	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, "H. Peter Anvin" <hpa@...or.com>,
	Joe Perches <joe@...ches.com>,
	David Miller <davem@...emloft.net>, linux-ia64@...r.kernel.org,
	linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org,
	sparclinux@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
	linux-metag@...r.kernel.org, linux-mips@...ux-mips.org,
	x86@...nel.org, user-mode-linux-devel@...ts.sourceforge.net,
	adi-buildroot-devel@...ts.sourceforge.net,
	linux-sh@...r.kernel.org, linux-xtensa@...ux-xtensa.org,
	xen-devel@...ts.xenproject.org, Ralf Baechle <ralf@...ux-mips.org>,
	Ingo Molnar <mingo@...nel.org>, ddaney.cavm@...il.com,
	james.hogan@...tec.com, Michael Ellerman <mpe@...erman.id.au>
Subject: Re: [v3,11/41] mips: reuse asm-generic/barrier.h

On Mon, Jan 25, 2016 at 06:02:34PM +0000, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > > and smp_read_acquire(), 
> > > 
> > > But they provide different grades of transitivity, which is where all
> > > the confusion lays.
> > > 
> > > smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> > > 
> > > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > > involved in the handover will agree on the order.
> > 
> > Good point!
> > 
> > Using grace periods in place of smp_mb() also provides strong/global
> > transitivity, but also insanely high latencies.  ;-)
> > 
> > The patch below updates Documentation/memory-barriers.txt to define
> > local vs. global transitivity.  The corresponding ppcmem litmus test
> > is included below as well.
> > 
> > Should we start putting litmus tests for the various examples
> > somewhere, perhaps in a litmus-tests directory within each participating
> > architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> > but they probably aren't doing all that much good there.
> 
> I too would like to have the litmus tests in the kernel so that we can
> refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
> to a particular arch, however.

Agreed.  Working on it...

> > PPC local-transitive
> > ""
> > {
> > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> > }
> >  P0           | P1           | P2           | P3           ;
> >  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
> >  lwsync       | lwsync       | lwsync       | sync         ;
> >  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
> >  lwsync       | lwz r7,0(r2) |              |              ;
> >  stw r1,0(r5) | lwsync       |              |              ;
> >               | stw r1,0(r6) |              |              ;
> > exists
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)
> 
> i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

Yep!

> > ------------------------------------------------------------------------
> > 
> > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> > Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > Date:   Fri Jan 15 09:30:42 2016 -0800
> > 
> >     documentation: Distinguish between local and global transitivity
> >     
> >     The introduction of smp_load_acquire() and smp_store_release() had
> >     the side effect of introducing a weaker notion of transitivity:
> >     The transitivity of full smp_mb() barriers is global, but that
> >     of smp_store_release()/smp_load_acquire() chains is local.  This
> >     commit therefore introduces the notion of local transitivity and
> >     gives an example.
> >     
> >     Reported-by: Peter Zijlstra <peterz@...radead.org>
> >     Reported-by: Will Deacon <will.deacon@....com>
> >     Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index c66ba46d8079..d8109ed99342 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
> >  General barriers are therefore required to ensure that all CPUs agree
> >  on the combined order of CPU 1's and CPU 2's accesses.
> >  
> > -To reiterate, if your code requires transitivity, use general barriers
> > -throughout.
> > +General barriers provide "global transitivity", so that all CPUs will
> > +agree on the order of operations.  In contrast, a chain of release-acquire
> > +pairs provides only "local transitivity", so that only those CPUs on
> > +the chain are guaranteed to agree on the combined order of the accesses.
> 
> Thanks for having a go at this. I tried defining something axiomatically,
> but got stuck pretty quickly. In my scheme, I used "data-directed
> transitivity" instead of "local transitivity", since the latter seems to
> be a bit of a misnomer.

I figured that "local" meant local to the CPUs participating in the
release-acquire chain.  As opposed to smp_mb() chains where the ordering
is "global" as in visible to all CPUs, whether on the chain or not.
Does that help?

> > +For example, switching to C code in deference to Herman Hollerith:
> > +
> > +	int u, v, x, y, z;
> > +
> > +	void cpu0(void)
> > +	{
> > +		r0 = smp_load_acquire(&x);
> > +		WRITE_ONCE(u, 1);
> > +		smp_store_release(&y, 1);
> > +	}
> > +
> > +	void cpu1(void)
> > +	{
> > +		r1 = smp_load_acquire(&y);
> > +		r4 = READ_ONCE(v);
> > +		r5 = READ_ONCE(u);
> > +		smp_store_release(&z, 1);
> > +	}
> > +
> > +	void cpu2(void)
> > +	{
> > +		r2 = smp_load_acquire(&z);
> > +		smp_store_release(&x, 1);
> > +	}
> > +
> > +	void cpu3(void)
> > +	{
> > +		WRITE_ONCE(v, 1);
> > +		smp_mb();
> > +		r3 = READ_ONCE(u);
> > +	}
> > +
> > +Because cpu0(), cpu1(), and cpu2() participate in a local transitive
> > +chain of smp_store_release()/smp_load_acquire() pairs, the following
> > +outcome is prohibited:
> > +
> > +	r0 == 1 && r1 == 1 && r2 == 1
> > +
> > +Furthermore, because of the release-acquire relationship between cpu0()
> > +and cpu1(), cpu1() must see cpu0()'s writes, so that the following
> > +outcome is prohibited:
> > +
> > +	r1 == 1 && r5 == 0
> > +
> > +However, the transitivity of release-acquire is local to the participating
> > +CPUs and does not apply to cpu3().  Therefore, the following outcome
> > +is possible:
> > +
> > +	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
> 
> I think you should be completely explicit and include r5 == 1 here, too.

Good point -- I added this as an additional outcome:

	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1

> Also -- where would you add the smp_mb__after_release_acquire to fix
> (i.e. forbid) this? Immediately after cpu1()'s read of y?

That sounds plausible, but we would first have to agree on exactly
what smp_mb__after_release_acquire() did.  ;-)

> > +Although cpu0(), cpu1(), and cpu2() will see their respective reads and
> > +writes in order, CPUs not involved in the release-acquire chain might
> > +well disagree on the order.  This disagreement stems from the fact that
> > +the weak memory-barrier instructions used to implement smp_load_acquire()
> > +and smp_store_release() are not required to order prior stores against
> > +subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
> > +store to u as happening -after- cpu1()'s load from v, even though
> > +both cpu0() and cpu1() agree that these two operations occurred in the
> > +intended order.
> > +
> > +However, please keep in mind that smp_load_acquire() is not magic.
> > +In particular, it simply reads from its argument with ordering.  It does
> > +-not- ensure that any particular value will be read.  Therefore, the
> > +following outcome is possible:
> > +
> > +	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
> > +
> > +Note that this outcome can happen even on a mythical sequentially
> > +consistent system where nothing is ever reordered.
> 
> I'm not sure this last bit is strictly necessary. If somebody thinks that
> acquire/release involve some sort of implicit synchronisation, I think
> they may have bigger problems with memory-barriers.txt.

Agreed.  But unless I add text like this occasionally, such people could
easily read through much of memory-barriers.txt and think that they did
in fact understand it.  So I have to occasionally trip an assertion in
their brain.  Or try to...  :-/

							Thanx, Paul