[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090212023308.GA21157@linux.vnet.ibm.com>
Date: Wed, 11 Feb 2009 18:33:08 -0800
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Mathieu Desnoyers <compudj@...stal.dyndns.org>
Cc: ltt-dev@...ts.casi.polymtl.ca, linux-kernel@...r.kernel.org
Subject: Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux
(repost)
On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@...ux.vnet.ibm.com) wrote:
>
> [ . . . ]
>
> > > > Hrm, let me present it in a different, more straightfoward way :
> > > >
> > > > In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> > > >
> > > > There is a memory barrier here in the updater :
> > > >
> > > > do
> > > > :: 1 ->
> > > > if
> > > > :: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > skip;
> > > > :: else -> break;
> > > > fi
> > > > od;
> > > > need_mb = 1;
> > > > do
> > > > :: need_mb == 1 -> skip;
> > > > :: need_mb == 0 -> break;
> > > > od;
> > > > urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > >
> > > I believe you were actually looking for a memory barrier here, not?
> > > I do not believe that your urcu.c has a memory barrier here, please
> > > see below.
> > >
> > > > do
> > > > :: 1 ->
> > > > if
> > > > :: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > skip;
> > > > :: else -> break;
> > > > fi;
> > > > od;
> > > >
> > > > However, in your C code of nest_32.c, there is none. So it is at the
> > > > very least an inconsistency between your code and your model.
> > >
> > > The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:
> > >
> > > synchronize_rcu()
> > >
> > > switch_qparity()
> > >
> > > force_mb_all_threads()
> > >
> > > switch_next_urcu_qparity() [Just does counter flip]
> > >
> >
> > Hrm... there would potentially be a missing mb() here.
>
> K, I added it to the model.
>
> > > wait_for_quiescent_state()
> > >
> > > Wait for all threads
> > >
> > > force_mb_all_threads()
> > > My model does not represent this
> > > memory barrier, because it seemed to
> > > me that it was redundant with the
> > > following one.
> > >
> >
> > Yes, this one is redundant.
>
> I left it in for now...
>
> > > I added it, no effect.
> > >
> > > switch_qparity()
> > >
> > > force_mb_all_threads()
> > >
> > > switch_next_urcu_qparity() [Just does counter flip]
> > >
> >
> > Same as above, potentially missing mb().
>
> I added it to the model.
>
> > > wait_for_quiescent_state()
> > >
> > > Wait for all threads
> > >
> > > force_mb_all_threads()
> > >
> > > The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
> > > follows:
> > >
> > > synchronize_rcu()
> > >
> > > flip_counter_and_wait()
> > >
> > > flips counter
> > >
> > > smp_mb();
> > >
> > > Wait for threads
> > >
> >
> > this is the point where I wonder if we should add a mb() to your code.
>
> Might well be, though I would argue for the very end, where I left out
> the smp_mb(). I clearly need to make another Promela model for this
> code, but we should probably focus on yours first, given that I don't
> have any use cases for mine.
>
> > > flip_counter_and_wait()
> > >
> > > flips counter
> > >
> > > smp_mb();
> > >
> > > Wait for threads
>
> And I really do have an unlock followed by an smp_mb() at this point.
>
> > > So, if I am reading the code correctly, I have memory barriers
> > > everywhere you don't and vice versa. ;-)
> > >
> >
> > Exactly. You have mb() between
> > flips counter and (next) Wait for threads
> >
> > I have mb() between
> > (previous) Wait for threads and flips counter
> >
> > Both might be required. Or none. :)
>
> Well, adding in the two to yours still gets Promela failures, please
> see attached. Nothing quite like a multi-thousand step failure case,
> I have to admit! ;-)
>
> > > The reason that I believe that I do not need a memory barrier between
> > > the wait-for-threads and the subsequent flip is that the threads we
> > > are waiting for have to have already committed to the earlier value of
> > > the counter, and so changing the counter out of order has no effect.
> > >
> > > Does this make sense, or am I confused?
> >
> > So if we remove the mb() as in your code, between the flips counter and
> > (next) Wait for thread, we are doing these operations in random order at
> > the write site:
>
> I don't believe that I get to remove and mb()s from my code...
>
> > Sequence 1 - what we expect
> > A.1 - flip counter
> > A.2 - read counter
> > B - read other threads urcu_active_readers
> >
> > So what happens if the CPU decides to reorder the unrelated
> > operations? We get :
> >
> > Sequence 2
> > B - read other threads urcu_active_readers
> > A.1 - flip counter
> > A.2 - read counter
> >
> > Sequence 3
> > A.1 - flip counter
> > A.2 - read counter
> > B - read other threads urcu_active_readers
> >
> > Sequence 4
> > A.1 - flip counter
> > B - read other threads urcu_active_readers
> > A.2 - read counter
> >
> >
> > Sequence 1, 3 and 4 are OK because the counter flip happens before we
> > read other thread's urcu_active_readers counts.
> >
> > However, we have to consider Sequence 2 carefully, because we will read
> > other threads uru_active_readers count before those readers see that we
> > flipped the counter.
> >
> > The reader side does either :
> >
> > seq. 1
> > R.1 - read urcu_active_readers
> > S.2 - read counter
> > RS.2- write urcu_active_readers, depends on read counter and read
> > urcu_active_readers
> >
> > (with R.1 and S.2 in random order)
> >
> > or
> >
> > seq. 2
> > R.1 - read urcu_active_readers
> > R.2 - write urcu_active_readers, depends on read urcu_active_readers
> >
> >
> > So we could have the following reader+writer sequence :
> >
> > Interleaved writer Sequence 2 and reader seq. 1.
> >
> > Reader:
> > R.1 - read urcu_active_readers
> > S.2 - read counter
> > Writer:
> > B - read other threads urcu_active_readers (there are none)
> > A.1 - flip counter
> > A.2 - read counter
> > Reader:
> > RS.2- write urcu_active_readers, depends on read counter and read
> > urcu_active_readers
> >
> > Here, the reader would have updated its counter as belonging to the old
> > q.s. period, but the writer will later wait for the new period. But
> > given the writer will eventually do a second flip+wait, the reader in
> > the other q.s. window will be caught by the second flip.
> >
> > Therefore, we could be tempted to think that those mb() could be
> > unnecessary, which would lead to a scheme where urcu_active_readers and
> > urcu_gp_ctr are done in a completely random order one vs the other.
> > Let's see what it gives :
> >
> > synchronize_rcu()
> >
> > force_mb_all_threads() /*
> > * Orders pointer publication and
> > * (urcu_active_readers/urcu_gp_ctr accesses)
> > */
> > switch_qparity()
> >
> > switch_next_urcu_qparity() [just does counter flip 0->1]
> >
> > wait_for_quiescent_state()
> >
> > wait for all threads in parity 0
> >
> > switch_qparity()
> >
> > switch_next_urcu_qparity() [Just does counter flip 1->0]
> >
> > wait_for_quiescent_state()
> >
> > Wait for all threads in parity 1
> >
> > force_mb_all_threads() /*
> > * Orders
> > * (urcu_active_readers/urcu_gp_ctr accesses)
> > * and old data removal.
> > */
> >
> >
> >
> > *but* ! There is a reason why we don't want to do this. If
> >
> > switch_next_urcu_qparity() [Just does counter flip 1->0]
> >
> > happens before the end of the previous
> >
> > Wait for all threads in parity 0
> >
> > We enter in a situation where all newly coming readers will see the
> > parity bit as 0, although we are still waiting for that parity to end.
> > We end up in a state when the writer can be blocked forever (no possible
> > progress) if there are steadily readers subscribed for the data.
> >
> > Basically, to put it differently, we could simply remove the bit
> > flipping from the writer and wait for *all* readers to exit their
> > critical section (even the ones simply interested in the new pointer).
> > But this shares the same problem the version above has, which is that we
> > end up in a situation where the writer won't progress if there are
> > always readers in a critical section.
> >
> > The same applies to
> >
> > switch_next_urcu_qparity() [Just does counter flip 0->1]
> >
> > wait for all threads in parity 0
> >
> > If we don't put a mb() between those two (as I mistakenly did), we can
> > end up waiting for readers in parity 0 while the parity bit wasn't
> > flipped yet. oops. Same potential no-progress situation.
> >
> > The ordering of memory reads in the reader for
> > urcu_active_readers/urcu_gp_ctr accesses does not seem to matter because
> > the data contains information about which q.s. period parity it is in.
> > In whichever order those variables are read seems to all work fine.
> >
> > In the end, it's to insure that the writer will always progress that we
> > have to enforce smp_mb() between *all* switch_next_urcu_qparity and wait
> > for threads. Mine and yours.
> >
> > Or maybe there is a detail I haven't correctly understood that insures
> > this already without the mb() in your code ?
> >
> > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > failure case I pointed out earlier. :-/ Here and I thought that the
> > > point of such models was to detect additional failure cases!!!)
> > >
> >
> > Yes, I'll have to dig deeper into it.
>
> Well, as I said, I attached the current model and the error trail.
And I had bugs in my model that allowed the rcu_read_lock() model
to nest indefinitely, which overflowed into the top bit, messing
things up. :-/
Attached is a fixed model. This model validates correctly (woo-hoo!).
Even better, gives the expected error if you comment out line 180 and
uncomment line 213, this latter corresponding to the error case I called
out a few days ago.
I will play with removing models of mb...
Thanx, Paul
View attachment "urcu.spin" of type "text/plain" (6865 bytes)
Powered by blists - more mailing lists