lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 17 Jun 2010 21:16:08 -0700 (PDT)
From:	Sage Weil <sage@...dream.net>
To:	Neil Brown <neilb@...e.de>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Alexey Dobriyan <adobriyan@...il.com>,
	Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
	hch@....de, bfields@...i.umich.edu
Subject: Re: weird umem vs nfsd regression

On Fri, 18 Jun 2010, Neil Brown wrote:
> On Thu, 17 Jun 2010 13:47:06 -0700
> Andrew Morton <akpm@...ux-foundation.org> wrote:
> 
> > On Thu, 17 Jun 2010 13:21:36 -0700 (PDT)
> > Sage Weil <sage@...dream.net> wrote:
> > 
> > > Hi,
> > > 
> > > I started seeing a hang during bootup of a machine with a umem 
> > > (micromemory nvram) card.  I bisected it and narrowed it down to commit 
> > > b95a5680 which merged a couple nfsd changes, although strangely reverting 
> > > just b160fdab ('nfsd: nfsd_setattr needs to call commit_metadata') is 
> > > sufficient to make the problem go away.
> 
> There are two possible explanations for this.
> One is that the breakage is caused be an interaction between a commit in one
> branch below the merge, and a commit in the other branch.

The merge was only 2 commits.  I tried reverting both and b160fdab was the 
winner.

> The other is that the bug has nothing to do with the code, it is randomly
> caused by some hardware timing issue, and the "git bisect" process just lead
> you on a random walk through git history.
> 
> It could be a combination of the two.  Some earlier commit makes the symptom
> possible but not probably.  'git bisect' would then lead you to some random
> commit after the causal commit.
> 
> Did you try each kernel multiple times?  Could you?

I just reverified 3x again, and the behavior is consistent.  v2.6.35-rc3 
fails, and reverting b160fdab does not.  I also did a make clean.

However, I took Andrew's suggestion and turned on some more random debug 
options, and with that .config v2.6.35-rc3 does not fai.  The diff is

4c4
< # Wed Jun 16 12:03:35 2010
---
> # Thu Jun 17 20:40:31 2010
2000c2000
< # CONFIG_DEBUG_SLAB_LEAK is not set
---
> CONFIG_DEBUG_SLAB_LEAK=y
2043c2043
< # CONFIG_DEBUG_PAGEALLOC is not set
---
> CONFIG_DEBUG_PAGEALLOC=y
2088c2088
< # CONFIG_DEBUG_PER_CPU_MAPS is not set
---
> CONFIG_DEBUG_PER_CPU_MAPS=y
2090c2090,2091
< # CONFIG_DEBUG_RODATA is not set
---
> CONFIG_DEBUG_RODATA=y
> CONFIG_DEBUG_RODATA_TEST=y
2092c2093
< # CONFIG_IOMMU_DEBUG is not set
---
> CONFIG_IOMMU_DEBUG=y
2104c2105
< # CONFIG_DEBUG_BOOT_PARAMS is not set
---
> CONFIG_DEBUG_BOOT_PARAMS=y
2264a2266
> # CONFIG_CPUMASK_OFFSTACK is not set

Both .config's are attached.

I wonder if zeroing out the memory (and clearing those umem errors) will 
change things, but once I do that I may not be able to reproduce this
hang.

I guess at this point I'll walk through those options one by one to see 
which one changes the behavior.

sage




> > > When it hangs, I see
> > > 
> > > [    1.069553] v2.3 : Micro Memory(tm) PCI memory board block driver
> > > [    1.075780] umem 0000:02:01.0: can't find IRQ for PCI INT C; probably buggy MP table
> > > [    1.083633] umem 0000:02:01.0: Micro Memory(tm) controller found (PCI Mem Module (Battery Backup))
> > > [    1.092762] umem 0000:02:01.0: CSR 0xfc9ffc00 -> 0xffffc90001466c00 (0x100)
> > > [    1.099880] umem 0000:02:01.0: Size 1048576 KB, Battery 1 Disabled (FAILURE), Battery 2 Disabled (FAILURE)
> > > [    1.109745] umem 0000:02:01.0: Window size 16777216 bytes, IRQ 9
> > > [    1.115842] umem 0000:02:01.0: memory NOT initialized. Consider over-writing whole device.
> > > [    1.125778]  umema:
> > > [  240.886560] INFO: task swapper:1 blocked for more than 120 seconds.
> > > [  240.893186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  240.901115] swapper       D 0000000000000001     0     1      0 0x00000000
> > > [  240.908231]  ffff8800f8da1b50 0000000000000046 ffff8800f8da1ac0 ffff8800f8da1fd8
> > > [  240.916027]  ffff8800f8d80050 0000000000004000 0000000000004000 00000000001d2180
> > > [  240.923819]  00000000001d2180 ffff8800f8da1fd8 ffff8800f8da1fd8 00000000001d2180
> > > [  240.931611] Call Trace:
> > > [  240.934133]  [<ffffffff8144f96d>] ? _raw_spin_unlock_irqrestore+0x4c/0x68
> > > [  240.941011]  [<ffffffff812a99ab>] ? mm_unplug_device+0x47/0x50
> 
> It looks like umem is waiting for the DMA controller on the card to
> acknowledge that the DMA is complete.
> Maybe the interrupt is getting lost somehow?  Shared interrupt line?  I guess
> we would see some sort of "nobody cared" messaged if it were something like
> that.
> 
> 
> > > and with the above commit reverted, I get a 'normal' umem driver init (the 
> > > umem errors/warnings are normal.. the batteries aren't connected and the 
> > > card isn't being used):
> 
> Is this really a reliable effect?  Boot several times without the commit and
> it works.  Add the commit and boot several times and it always fails.   Then
> remove the commit again and it reliably works?
> 
> I don't like to have to ask such basic questions, but it is a really weird
> error so I need to be sure to eliminate any mismeasurement.
> 
> 
> 
> > There may also be bugs in the umem driver.  Even if the IO errors are
> > bogus, the kernel shouldn't hang up waiting for IO completion as it's
> > doing here.
> > 
> 
> No?  Even if it is due to faulty hardware?
> Do you think the driver should set a timer and disable the card if it hasn't
> heard back for a while?
> I guess that might be reasonable, though if it turns out to be faulty
> hardware I wouldn't trust it on the buss at all...
> 
> NeilBrown
> 
> 
View attachment ".config.umembetter" of type "TEXT/PLAIN" (60249 bytes)

View attachment ".config.umembad" of type "TEXT/PLAIN" (60249 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ