[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.1006172109010.6914@cobra.newdream.net>
Date: Thu, 17 Jun 2010 21:16:08 -0700 (PDT)
From: Sage Weil <sage@...dream.net>
To: Neil Brown <neilb@...e.de>
cc: Andrew Morton <akpm@...ux-foundation.org>,
Alexey Dobriyan <adobriyan@...il.com>,
Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
hch@....de, bfields@...i.umich.edu
Subject: Re: weird umem vs nfsd regression
On Fri, 18 Jun 2010, Neil Brown wrote:
> On Thu, 17 Jun 2010 13:47:06 -0700
> Andrew Morton <akpm@...ux-foundation.org> wrote:
>
> > On Thu, 17 Jun 2010 13:21:36 -0700 (PDT)
> > Sage Weil <sage@...dream.net> wrote:
> >
> > > Hi,
> > >
> > > I started seeing a hang during bootup of a machine with a umem
> > > (micromemory nvram) card. I bisected it and narrowed it down to commit
> > > b95a5680 which merged a couple nfsd changes, although strangely reverting
> > > just b160fdab ('nfsd: nfsd_setattr needs to call commit_metadata') is
> > > sufficient to make the problem go away.
>
> There are two possible explanations for this.
> One is that the breakage is caused be an interaction between a commit in one
> branch below the merge, and a commit in the other branch.
The merge was only 2 commits. I tried reverting both and b160fdab was the
winner.
> The other is that the bug has nothing to do with the code, it is randomly
> caused by some hardware timing issue, and the "git bisect" process just lead
> you on a random walk through git history.
>
> It could be a combination of the two. Some earlier commit makes the symptom
> possible but not probably. 'git bisect' would then lead you to some random
> commit after the causal commit.
>
> Did you try each kernel multiple times? Could you?
I just reverified 3x again, and the behavior is consistent. v2.6.35-rc3
fails, and reverting b160fdab does not. I also did a make clean.
However, I took Andrew's suggestion and turned on some more random debug
options, and with that .config v2.6.35-rc3 does not fai. The diff is
4c4
< # Wed Jun 16 12:03:35 2010
---
> # Thu Jun 17 20:40:31 2010
2000c2000
< # CONFIG_DEBUG_SLAB_LEAK is not set
---
> CONFIG_DEBUG_SLAB_LEAK=y
2043c2043
< # CONFIG_DEBUG_PAGEALLOC is not set
---
> CONFIG_DEBUG_PAGEALLOC=y
2088c2088
< # CONFIG_DEBUG_PER_CPU_MAPS is not set
---
> CONFIG_DEBUG_PER_CPU_MAPS=y
2090c2090,2091
< # CONFIG_DEBUG_RODATA is not set
---
> CONFIG_DEBUG_RODATA=y
> CONFIG_DEBUG_RODATA_TEST=y
2092c2093
< # CONFIG_IOMMU_DEBUG is not set
---
> CONFIG_IOMMU_DEBUG=y
2104c2105
< # CONFIG_DEBUG_BOOT_PARAMS is not set
---
> CONFIG_DEBUG_BOOT_PARAMS=y
2264a2266
> # CONFIG_CPUMASK_OFFSTACK is not set
Both .config's are attached.
I wonder if zeroing out the memory (and clearing those umem errors) will
change things, but once I do that I may not be able to reproduce this
hang.
I guess at this point I'll walk through those options one by one to see
which one changes the behavior.
sage
> > > When it hangs, I see
> > >
> > > [ 1.069553] v2.3 : Micro Memory(tm) PCI memory board block driver
> > > [ 1.075780] umem 0000:02:01.0: can't find IRQ for PCI INT C; probably buggy MP table
> > > [ 1.083633] umem 0000:02:01.0: Micro Memory(tm) controller found (PCI Mem Module (Battery Backup))
> > > [ 1.092762] umem 0000:02:01.0: CSR 0xfc9ffc00 -> 0xffffc90001466c00 (0x100)
> > > [ 1.099880] umem 0000:02:01.0: Size 1048576 KB, Battery 1 Disabled (FAILURE), Battery 2 Disabled (FAILURE)
> > > [ 1.109745] umem 0000:02:01.0: Window size 16777216 bytes, IRQ 9
> > > [ 1.115842] umem 0000:02:01.0: memory NOT initialized. Consider over-writing whole device.
> > > [ 1.125778] umema:
> > > [ 240.886560] INFO: task swapper:1 blocked for more than 120 seconds.
> > > [ 240.893186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 240.901115] swapper D 0000000000000001 0 1 0 0x00000000
> > > [ 240.908231] ffff8800f8da1b50 0000000000000046 ffff8800f8da1ac0 ffff8800f8da1fd8
> > > [ 240.916027] ffff8800f8d80050 0000000000004000 0000000000004000 00000000001d2180
> > > [ 240.923819] 00000000001d2180 ffff8800f8da1fd8 ffff8800f8da1fd8 00000000001d2180
> > > [ 240.931611] Call Trace:
> > > [ 240.934133] [<ffffffff8144f96d>] ? _raw_spin_unlock_irqrestore+0x4c/0x68
> > > [ 240.941011] [<ffffffff812a99ab>] ? mm_unplug_device+0x47/0x50
>
> It looks like umem is waiting for the DMA controller on the card to
> acknowledge that the DMA is complete.
> Maybe the interrupt is getting lost somehow? Shared interrupt line? I guess
> we would see some sort of "nobody cared" messaged if it were something like
> that.
>
>
> > > and with the above commit reverted, I get a 'normal' umem driver init (the
> > > umem errors/warnings are normal.. the batteries aren't connected and the
> > > card isn't being used):
>
> Is this really a reliable effect? Boot several times without the commit and
> it works. Add the commit and boot several times and it always fails. Then
> remove the commit again and it reliably works?
>
> I don't like to have to ask such basic questions, but it is a really weird
> error so I need to be sure to eliminate any mismeasurement.
>
>
>
> > There may also be bugs in the umem driver. Even if the IO errors are
> > bogus, the kernel shouldn't hang up waiting for IO completion as it's
> > doing here.
> >
>
> No? Even if it is due to faulty hardware?
> Do you think the driver should set a timer and disable the card if it hasn't
> heard back for a while?
> I guess that might be reasonable, though if it turns out to be faulty
> hardware I wouldn't trust it on the buss at all...
>
> NeilBrown
>
>
View attachment ".config.umembetter" of type "TEXT/PLAIN" (60249 bytes)
View attachment ".config.umembad" of type "TEXT/PLAIN" (60249 bytes)
Powered by blists - more mailing lists