[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080809135650.GE8125@linux.vnet.ibm.com>
Date: Sat, 9 Aug 2008 06:56:50 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: David Witbrodt <dawitbro@...global.net>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Yinghai Lu <yhlu.kernel@...il.com>,
Ingo Molnar <mingo@...e.hu>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>, netdev <netdev@...r.kernel.org>
Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem
On Sat, Aug 09, 2008 at 05:39:26AM -0700, David Witbrodt wrote:
>
>
> > On Fri, 2008-08-08 at 18:23 -0700, David Witbrodt wrote:
> > > I have tracked the regression down to an RCU problem.
> > > [...]
> > > After reading some documentation in Documentation/RCU/, it looks like
> > > something is misusing RCU -- and, according to the Documentation, those kinds
> > > of mistakes are easy to make. Maybe necessary calls to
> > >
> > > rcu_read_lock()
> > > rcu_read_unlock()
> > >
> > > are missing, and something about my hardware is triggering a freeze that
> > > doesn't occur on most hardware.
> > >
> > >
> > > For some reason, turning off the HPET by booting with "hpet=disabled" keeps
> > > the freeze from happening. Just reading a couple of those docs about RCU
> > > made me dizzy, so I hope someone familiar with RCU issues will take a look
> > > at the code in the files I've listed. Surely you guys can take it from here
> > > now?!
> > >
> > > If not, just give me some experimental code changes to make to get my 2.6.26
> > > and 2.6.27 kernels working again without disabling HPET!!!
> >
> >
> > The typical way to deadlock like this is do something like:
> >
> > rcu_read_lock();
> >
> > synchronize_rcu();
> >
> > rcu_read_unlock();
> >
> > While I cannot immediately see any such usage in the function you
> > quoted, it could be on of the callers.. let me browse some code..
> >
> > Can't seem to find anything like that.
> >
> > What's weird though - is that HPET makes any difference on these network
> > code paths.
> >
> > Could we end up calling rcu too soon? I doubt we bring up ipv4 before
> > rcu..
>
> I'm _way_ over my head in this discussion, but here's some more food
> for thought. Last weekend, when I first tried 2.6.26 and discovered the
> freeze, I thought an error of my own in .config was causing it. Before
> I ever sought help, I made about a dozen experiments with different
> .config files.
>
> One series of those experiments involved turning off most of the kernel...
> including CONFIG_INET. The kernel still froze, but when entering
> pci_init(). (This info can be read in my original post to the Debian BTS,
> which I have provided links for a couple of times in this LKML thread. I
> even went further and removed enough that the freeze was avoided, but so
> much of the kernel was missing that my init scripts couldn't mount a hard
> disk any more. Trying to restore enough to allow HD mounting just brought
> back the freeze.)
>
> I am completely ignorant about how the kernel works, so any guesses I have
> are probably worthless... but I'll throw some out anyway:
>
> 1. Maybe HPET is used (if present) for timing by RCU, so disabling it
> forces RCU to work differently. (Pure guess here: I know nothing about
> RCU, and haven't even tried looking at its code.)
RCU doesn't use HPET directly. Most of its time-dependent behavior
comes from its being invoked from the scheduling-clock interrupt.
> 2. Maybe my hardware is broken. We need see one initcall return that
> report over 280,000 msecs... when the entire boot->freeze time was about
> 3 secs. On the other hand, 2.6.25 (and before) work just fine with HPET
> enabled.
For CONFIG_CLASSIC_RCU and !CONFIG_PREEMPT, in-kernel infinite spin loops
will cause synchronize_rcu() to hang. For other RCU configurations,
spinning with interrupts disabled will result in similar hangs. Invoking
synchronize_rcu() very early in boot (before rcu_init() has been called)
will of course also hang.
Could you please let me know whether your config has CONFIG_CLASSIC_RCU
or CONFIG_PREEMPT_RCU?
> 3. I was able to find the commit that introduced the freeze
> (3def3d6ddf43dbe20c00c3cbc38dfacc8586998f), so there has to be a connection
> between that commit and the RCU problem. Is it possible that a prexisting
> error or oversight in the code was merely exposed by that commit? (And
> only on certain hardware?) Or does that code itself contain the error?
Thank you for finding the commit -- should be quite helpful!!!
A quick look reveals what appears to be reader-writer locking rather
than RCU. It does run in early boot before rcu_init(), so if it managed
to call synchronize_rcu() somehow you indeed would see a hang. I do
not see such a call, but then again, I don't know this code much at all.
This is the second time in as many days that motivated RCU's working
correctly before rcu_init()... Hmmm...
> 4. Another bug has been posted on the Debian BTS, which is worked around
> by disabling HPET. The user provided some links to bugzilla.kernel.org
> where David Brownell is fighting with some HPET/RTC issues (but no mention
> of RCU):
> http://bugzilla.kernel.org/show_bug.cgi?id=11111
> http://bugzilla.kernel.org/show_bug.cgi?id=11153
>
> I honestly don't know whether this is related to my problem or not. :-(
Nor me.
> If any has any test code I can run to detect massive HPET breakage on
> these motherboards, I'll be glad to do so. Or any other experimental
> code changes, for that matter.
If you can answer my CONFIG_CLASSIC_RCU vs. CONFIG_PREEMPT_RCU question
above, I should be able to provide you a diagnostic patch that would say
which CPU RCU was waiting on. At least assuming that at least one CPU
was still taking the scheduling-clock interrupt, that is. ;-)
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists