[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110714042932.GA2932@linux.vnet.ibm.com>
Date:	Wed, 13 Jul 2011 21:29:32 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...e.hu>
Subject: Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
On Wed, Jul 13, 2011 at 09:33:09AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 13, 2011 at 10:24:08AM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote:
> > > > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> > > >        * prev into current:
> > > >        */
> > > >       spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> > > > +     rcu_read_acquire();
> > > 
> > > Oooh...  This is a tricky one.  Hmmm...
> > 
> > <snip>
> > 
> > > Does any of this make sense?
> > 
> > No?
> > 
> > > > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > > >        */
> > > >  #ifndef __ARCH_WANT_UNLOCKED_CTXSW
> > > >       spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> > > > +     rcu_read_release();
> > > 
> > > My guess is that we don't need the rcu_read_release() -- the arch shouldn't
> > > care that we have a non-atomic field in task_struct incremented, right?
> > > 
> > > Or am I confused about what this is doing?
> > 
> > Either that or I used the wrong primitives for what I was meaning to do.
> > 
> > So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep
> > annotations of the rcu_read_lock. The thing we're doing above is context
> > switching and not making lockdep upset.
> > 
> > The problem is that held lock state is tracked per task, and we're well
> > switching tasks, so we need to transfer the held lock state from the old
> > to the new task, since the new task will be unlocking the thing, if at
> > that time lockdep finds we're not actually holding it it screams bloody
> > murder.
> > 
> > So what we do is we release the lock (annotation only, the actual lock
> > says locked) on prev before switching and acquire the lock on next right
> > after switching.
> 
> OK, got it.  I think, anyway.
> 
> So preemptible RCU and context switches are handled as follows by
> your patch, correct?
> 
> 1.	The outgoing task is in an RCU read-side critical section
> 	due to the rcu_read_lock() that you added at the beginning
> 	of schedule().
> 
> 2.	In context_switch(), the lockdep state is set so that lockdep
> 	believes that the task has exited its RCU read-side critical
> 	section (thus suppressing the lockdep splat that would otherwise
> 	appear).
> 
> 3.	Once the context switch has happened, RCU no longer pays
> 	attention to the outgoing task.  So the outgoing task's
> 	current->rcu_read_lock_nesting still indicates that the task is in
> 	an RCU read-side critical section, but RCU does not know or care.
> 
> 4.	If the incoming task has done at least one prior context switch,
> 	its current->rcu_read_lock_nesting will indicate that it is
> 	already in an RCU read-side critical section.  Since there is
> 	no opportunity for RCU to report a quiescent state during the
> 	context switch itself (because holding the runqueue lock keeps
> 	interrupts disabled), RCU sees the whole context switch event
> 	as being one big RCU read-side critical section.
> 
> 	If interrupts were enabled at any point, RCU might see this as
> 	two back-to-back read-side critical sections.  This might be
> 	OK for all I know, but it would open the door more widely to
> 	heavyweight RCU processing at rcu_read_unlock() time.
> 
> 5.	Yes, context switch is a quiescent state, but that quiescent
> 	state happens when rcu_note_context_switch() is called from
> 	schedule, which is before the runqueue lock is acquired, and
> 	thus before the beginning of the RCU read-side critical section
> 	that covers the context switch.
> 
> 6.	In finish_lock_switch(), the rcu_read_acquire() restores
> 	lockdep state to match that of the incoming task.
> 
> 	But if the task is newly created, then its value of
> 	current->rcu_read_lock_nesting will be zero, which will leave
> 	the incoming task unprotected by RCU.  I believe that you
> 	need the (untested) patch below to make this work properly.
> 	But this patch needs to be included in your set, as I don't
> 	immediately see a reasonable way to make it work independently.
> 	(Yes, we could have a special CONFIG var for the transition,
> 	but it sounds a lot easier to just have it be part of your patch.)
> 
> 7.	Exit processing is done by the task itself before a last
> 	call to schedule().  From what I can see, it should not matter
> 	that the value of current->rcu_read_lock_nesting gets incremented
> 	on this last TASK_DEAD call to schedule().
> 
> Or am I still confused?
Hmmm...  Something is confused.  I get boot time hangs with the
occasional stack overflow, whether or not I add my patch on top.
Left to myself, I would try applying your patch incrementally, and
also disabling irqs before rcu_read_lock() and enabling them after
rcu_read_unlock(), as this prevents rcu_read_unlock() from ever getting
to the rcu_read_unlock_special() slowpath.
							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
