linux-kernel - Re: Uses for memory barriers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.44L0.0609131207340.5908-100000@iolanthe.rowland.org>
Date:	Thu, 14 Sep 2006 10:58:43 -0400 (EDT)
From:	Alan Stern <stern@...land.harvard.edu>
To:	"Paul E. McKenney" <paulmck@...ibm.com>
cc:	David Howells <dhowells@...hat.com>,
	Kernel development list <linux-kernel@...r.kernel.org>
Subject: Re: Uses for memory barriers

On Tue, 12 Sep 2006, Paul E. McKenney wrote:

> > Incidentally, I object very strongly to the way you hardware folk keep
> > talking about events being only partially ordered.  This isn't the case at
> > all.  Given a precise physical definition of what an event is (a
> > particular transistor changing state for example, or perhaps the entire
> > time interval starting when one transistor changes state and ending when
> > another does), each event takes place at a moment of time or during an
> > interval of time.
> 
> Sorry, but there is absolutely no way I am carrying this discussion down
> to the transistor level.  You need to talk to a real hardware guy for
> that.  ;-)

I don't mind keeping the discussion at a higher level.  The real point is
that physical events occur at real, physical times.

> > Now time really _is_ totally ordered (so long as you confine yourself to a
> > single reference frame!) and hence in any individual run of a computer
> > program so are the events -- provided you take proper care when discussing
> > events that encompass a time interval.
> 
> But different CPUs would have different opinions about the time at which
> a given event (e.g., a store) occurred.

That's okay.  There isn't any single notion of "This store occurred at 
this time".  Instead, for each CPU there is a notion of "This store became 
visible to this CPU at this time".

>  They would also have different
> opinions about the order in which sequences of stores occurred.

Again, not a problem.  And for the same reason.  Although if the stores 
are all to the same location, different CPUs should _not_ have different 
opinions about the order in which they became visible.

>  So the
> linear underlying time is not all that relevant -- maybe it is there, maybe
> not, but software cannot see it, so it cannot rely on it.  Hence the partial
> order.

Maybe the software can't see it directly, but it can still help when
you're trying to reason about the program's behavior.

> The underlying problem is a given load or store is actually made up of
> many events, which will occur at different times.  You could arbitrarily
> choose a particular event as being the key event that determines the
> time for the corresponding load or store, but this won't really help.
> Which of the following events should determine the timestamp for a
> given store from CPU 0?

That's not a valid notion.  Let's ask instead what is the timestamp for 
when a given store from CPU 0 becomes visible to CPU 1.

> 1.	When the instruction determines what value to store?
> 
> 2.	When an entry on CPU 0's store queue becomes available?
> 
> 3.	When the store-queue entry is filled in?  (This is the earliest
> 	time that makes sense to me.)
> 
> 4.	When the invalidation request is transmitted (and more than
> 	one such transmission is required on some architectures)?
> 
> 5.	When the invalidation request is received?  If so, by which
> 	CPU?
> 
> 6.	When CPUs send acknowledgements for the invalidation requests?

This is a very good candidate: when CPU 1 sends its acknowledgement.  Any
read on CPU 1 which commits to a return value before this time will not
see the new value.  A read on CPU 1 which commits to a return value after
this time may see the new value.

> 7.	When CPU 0 receives the acknowledgments?  (The reception of
> 	the last such acknowledgement would be a reasonable choice
> 	in some cases.)
> 
> 8.	When CPU 0 applies the store-queue entry to the cache line?
> 
> 9.	When the other CPUs process the invalidate request?
> 	(The processing of the invalidate request at the last
> 	CPU to do so would be another reasonable choice in some cases.)
> 
> I have an easier time thinking of this as a partial order than to try
> to keep track of many individual events that I cannot accurately measure.

All right, let's express things in terms of the partial order.  In those 
terms, this is the principle you would expound:

(P1):	If two writes are ordered on the same CPU and two reads are
	ordered on another CPU, and the first read sees the result of
	the second write, then the second read will see the result of
	the first write.

Correct?

> > The difficulty is that program runs have a stochastic character; it's
> > difficult or impossible to predict exactly what the hardware will do.  So
> > the same program events could be ordered one way during a run and the
> > opposite way during another run.  Each run is still totally ordered.
> 
> But if I must prove some property for all possible actual orderings, then
> keeping track of a partial ordering does make sense.  Enumerating all of
> the possible actual orderings is often infeasible.

You don't have to enumerate all possible cases to prove things.  
Mathematics would still be stuck back at the time of the ancient Greeks if
that were so.

> > If you want, you can derive a partial ordering by considering only those
> > pairs of events that will _always_ occur in the same order in _every_ run.
> > But thinking in terms of this sort of partial ordering is clearly
> > inadequate.  It doesn't even let you derive the properties of the
> > "canonical" pattern:
> > 
> > 	CPU 0			CPU 1
> > 	-----			-----
> > 	a = 1;			y = b;
> > 	wmb();			rmb();
> > 	b = 1;			x = a;
> > 				assert(!(y==1 && x==0));
> > 
> > In terms of partial orderings, we know that all the events on CPU 0 are
> > ordered by the wmb() and the events on CPU 1 are ordered by the rmb().  
> > But there is no ordering relation between the events on the two CPUs.  "y
> > = b" could occur either before or after "b = 1", hence these two events
> > are incomparable in the partial ordering.  So you can't conclude anything
> > about the assertion, if you confine your thinking to the partial ordering.
> 
> But I don't need absolute timings, all I need to know is whether or not
> CPU 1's "y=b" sees CPU 0's "b=1".

How does the partial-ordering point of view help you understand the
read-mb-write pattern?

	CPU 0			CPU 1
	-----			-----
	x = a;			y = b + 1;
	mb();			mb();
	b = 1;			a = 1;
	assert(!(x==1 && y!=1));

You would need an additional principle similar to P1 above.  Trying to
apply P1 won't work; all you can deduce is that CPU 1 writes y and a
whereas CPU 0 reads a and y, hence CPU 1 can assert only (!(x==1 &&
y==0)), not (!(x==1 && y!=1)).

This idea, that some _new_ principle P2 is needed to explain
read-mb-write, is one of the major points I have been trying to establish
here.  Discussions like David's always mention P1 but they hardly ever
mention this P2.  In fact I can't recall ever seeing it mentioned before
this email thread.  And yet, as we've seen, P2 is essential for explaining
why synchronization primitives work.


> > One could derive the result a little more formally using the causality 
> > principle I mentioned above (LOAD before STORE cannot return the result of 
> > the STORE) together with another causality principle: A CPU cannot make 
> > the result of a STORE available to other CPUs (or to itself) before it 
> > knows the value to be stored.  A cache might send out invalidate messages 
> > before knowing the value to be stored, but it can't send out read 
> > responses.
> 
> Yes but...  Some other CPU might see the resulting store before it
> saw the information that the storing CPU based the stored value on.

Nothing wrong with that.  As I have mentioned before, stores don't have to 
become visible to all CPUs at the same time.

> > What do you mean by "violate causality"?  Storing a value in y (and making
> > that value available to the other CPU) before the read has obtained b's
> > value?  I don't see how that could happen on any architecture, unless your 
> > machine uses thiotimoline.  :-)
> 
> Different CPUs can perceive a given event happening at different times,
> so other CPUs might see the store into y before they saw the logically
> earlier store into b.

Nothing wrong with that either, and for the same reason.

>  Thiotimoline available on eBay?  I would be indeed
> be interested!  ;-)

You and me both!

> Your point might be that this example doesn't care, given that the CPU
> doing the store into y is also doing the assert, but that doesn't change
> my basic point that you have to be -very- careful when using causality
> arguments when working out memory ordering.

I think that all you really need to do is remember that statements can be 
reordered, that stores become visible to different CPUs at different 
times, and that loads don't always return the value of the last visible 
store.  Everything else is intuitive -- except for the way memory barriers 
work!

> > I wouldn't put it that way.  Knowing that CPUs are free to reorder
> > operations in the absence of barriers to prevent such things, the
> > violation you mention wouldn't seem unintuitive to me.  I tend to have 
> > more trouble remembering that reads don't necessarily have to return the 
> > most recent values available.
> 
> So do I, which is one reason that I resist the notion that stores happen
> at a single specific time.  If I consider stores to be fuzzy events, then
> it is easier to for me to keep in mind that concurrent reads to the
> same value from different CPUs might return different values.

Instead of considering stores to be fuzzy events, you can think of them as
becoming visible at precise times to individual CPUs.  That sort of
reasoning would help someone to understand P2, whereas the
partially-ordered fuzzy-events approach would not.


> > > > This read-mb-write pattern turns out to be vital for implementing
> > > > sychronization primitives.  Take your own example:
> > > > 
> > > > > Consider the following (lame) definitions for spinlock primitives,
> > > > > but in an alternate universe where atomic_xchg() did not imply a
> > > > > memory barrier, and on a weak-memory CPU:
> > > > > 
> > > > > 	typedef spinlock_t atomic_t;
> > > > > 
> > > > > 	void spin_lock(spinlock_t *l)
> > > > > 	{
> > > > > 		for (;;) {
> > > > > 			if (atomic_xchg(l, 1) == 0) {
> > > > > 				smp_mb();
> > > > > 				return;
> > > > > 			}
> > > > > 			while (atomic_read(l) != 0) barrier();
> > > > > 		}
> > > > > 
> > > > > 	}
> > > > > 
> > > > > 	void spin_unlock(spinlock_t *l)
> > > > > 	{
> > > > > 		smp_mb();
> > > > > 		atomic_set(l, 0);
> > > > > 	}
> > > > > 
> > > > > The spin_lock() primitive needs smp_mb() to ensure that all loads and
> > > > > stores in the following critical section happen only -after- the lock
> > > > > is acquired.  Similarly for the spin_unlock() primitive.
> > > > 
> > > > In fact that last paragraph isn't quite right.  The spin_lock()
> > > > primitive would also work with smp_rmb() in place of smb_mb().  (I can
> > > > explain my reasoning later, if you're interested.)
> > > 
> > > With smp_rmb(), why couldn't the following, given globals b and c:
> > > 
> > > 	spin_lock(&mylock);
> > > 	b = 1;
> > > 	c = 1;
> > > 	spin_unlock(&mylock);
> > > 
> > > be executed by the CPU as follows?
> > > 
> > > 	c = 1;
> > > 	b = 1;
> > > 	spin_lock(&mylock);
> > > 	spin_unlock(&mylock);
> > 
> > Because of the conditional in spin_lock().
> > 
> > > This order of execution seems to me to be highly undesireable.  ;-)
> > > 
> > > So, what am I missing here???
> > 
> > A CPU cannot move a write back past a conditional.  Otherwise it runs
> > the risk of committing the write when (according to the program flow) the
> > write should never have taken place.  No CPU does speculative writes.
> 
> From the perspective of some other CPU holding the lock, agreed (I
> think...).  From the perspective of some other CPU reading the lock
> word and the variables b and c, I do not agree.  The CPU doing the
> stores does not have to move the writes back past the conditional --
> all it has to do is move the "c=1" and "b=1" stores back past the store
> implied by the atomic operation in the spin_lock().  All three stores
> then appear to have happened after the conditional.

Ah.  I'm glad you mentioned this.

Yes, it's true that given sufficiently weak ordering, the writes could 
be moved back inbetween the load and store implicit in the atomic 
exchange.  It's worth pointing out that even if this does occur, it will 
not be visible to any CPU that accesses b and c only from within a 
critical section.  But it could be visible in a situation like this:

	CPU 0				CPU 1
	-----				-----
	[call spin_lock(&mylock)]	x = b;
	read mylock, obtain 0		mb();
	b = 1; [moved up]		y = atomic_read(&mylock);
	write mylock = 1
	rmb();
	[leave spin_lock()]
					mb();
					assert(!(x==1 && y==0 && c==0));
	c = 1;
	spin_unlock(&mylock);

The assertion could indeed fail.  HOWEVER...

What you may not realize is that even when spin_lock() contains your
original full mb(), if CPU 0 reads b (instead of writing b) that read
could appear to another CPU to have occurred before the write to mylock.  
Example:

	CPU 0				CPU 1
	-----				-----
	[call spin_lock(&mylock)]	b = 1;
	read mylock, obtain 0		mb();
	write mylock = 1		y = atomic_read(&mylock);
	mb();
	[leave spin_lock()]

	x = b + 1;			mb();
					assert(!(x==1 && y==0 && c==0));
	c = 1;
	spin_unlock(&mylock);

If the assertion fails, then CPU 1 will think that CPU 0 read the value of 
b before setting mylock to 1.  But the assertion can fail!  It's another 
example of the write-mb-read pattern.  If CPU 0's write to mylock crosses 
with CPU 1's write to b, then both CPU 0's read of b and CPU 1's read of 
mylock could obtain the old values.

So in this respect your implementation already fails to prevent reads from 
leaking partway out of the critical section.  My implementation does the 
same thing with writes, but otherwise is no worse.


> > What's the difference between imposing an ordering constraint and forcing
> > two events to occur in a particular order?  To me the phrases appear
> > synonymous.
> 
> That is because you persist in believing that things happen at a
> well-defined single time.  ;-)

Physical events _do_ happen at well-defined times.  Even if we don't know 
exactly when those times occur.


> > A more correct pattern would have to look like this:
> > 
> > 	CPU 0			CPU 1
> > 	-----			-----
> > 	while (c == 0) ;	wmb();
> > 				// Earlier writes to a or b are now flushed
> > 				c = 1;
> > 
> > 	// arbitrarily long delay with neither CPU writing to a or b
> > 
> > 	a = 1;			y = b;
> > 	wmb();			rmb();
> > 	b = 1;			x = a;
> > 				assert(!(y==1 && x==0));
> > 
> > In general you might say that a LOAD following rmb won't return the value 
> > from any STORE that became visible before the rmb other than the last one.  
> > But that leaves open the question of which STOREs actually do become 
> > visible.  I don't know how to deal with that in general.
> 
> The best answer I can give is that an rmb() -by- -itself- means absolutely
> nothing.  The only way it has any meaning is in conjunction with an
> mb() or a wmb() on some other CPU.  The "pairwise" stuff in David's
> documentation.

Even that isn't enough.  You also have to have some sort of additional 
information that can guarantee CPU 0's write to b will become visible to 
CPU 1.  If it never becomes visible then the assertion could fail.

But I can't think of any reasonable way to obtain such a guarantee that 
isn't implementation-specific.  Which leaves things in a rather 
unsatisfactory state...

> In other words, if there was an rmb() in flight on one CPU, but none
> of the other CPUs ever executed a memory barrier of any kind, then
> the lone rmb() would not need to place any constraints on any CPU's
> memory access.
> 
> Or, more accurately, if you wish to understand an rmb() in isolation,
> you must do so in the context of a specific hardware implementation.

I'll buy that second description but not the first.  It makes no sense to
say that what an rmb() instruction does to this CPU right here depends
somehow on what that CPU way over there is doing.  Instructions are
supposed to act locally.

> > Perhaps you could explain why (other than the fact that it might fail!).
> 
> Ummm...  Isn't that reason enough?
> 
> One reason is that the ++cnt can be reordered by both CPU and compiler
> to precede some of the code in "// Do something time-consuming with *x".
> I am assuming that only one thread is permitted to execute in the loop
> at the same time, otherwise there are many other failure mechanisms.

Yes.  There would have to be a barrier before ++cnt.

> > I actually do have code like this in one of my drivers.  But there are
> > significant differences from the example.  In the driver T0 doesn't run on
> > a CPU; it runs on a peripheral controller which accesses p via DMA and
> > makes cnt available over the I/O bus (inw).
> 
> OK, so there is only one thread T0...   And so it depends on the
> peripheral controller's memory model.  I would hope that they sorted
> out the synchronization.  ;-)

Me too!

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/