[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140526142948.GA1685@lahna.fi.intel.com>
Date: Mon, 26 May 2014 17:29:48 +0300
From: Mika Westerberg <mika.westerberg@...ux.intel.com>
To: Al Viro <viro@...IV.linux.org.uk>
Cc: linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Miklos Szeredi <mszeredi@...e.cz>,
linux-fsdevel@...r.kernel.org
Subject: Re: fs/dcache.c - BUG: soft lockup - CPU#5 stuck for 22s!
[systemd-udevd:1667]
On Mon, May 26, 2014 at 02:57:47PM +0100, Al Viro wrote:
> [fsdevel and folks who'd been on d_lru corruption thread Cc'd - that's
> a continuation of the same mess]
>
> On Mon, May 26, 2014 at 12:37:41PM +0300, Mika Westerberg wrote:
> > Hi,
> >
> > After v3.15-rc4 my Fedora 20 system with mainline kernel has been suffering
> > from the above lockup.
> >
> > This is easy to reproduce:
> >
> > 1) Plug in USB memory stick (to xHCI port)
> > 2) Unplug it
> >
> > Typically only one iteration is needed and suddenly I can see
> > systemd-udev taking 100% CPU and eventually the whole system becomes
> > unusable.
> >
> > I've tried to investigate and it looks like we just spin in
> > shrink_dentry_list() forever. Reverting following fs/dcache.c commits
> > the issue goes away:
> >
> > 60942f2f235ce7b817166cdf355eed729094834d dcache: don't need rcu in shrink_dentry_list()
> > 9c8c10e262e0f62cb2530f1b076de979123183dd more graceful recovery in umount_collect()
> > fe91522a7ba82ca1a51b07e19954b3825e4aaa22 don't remove from shrink list in select_collect()
>
> Which means that we very likely have a reproducer for d_lru-corrupting
> races in earlier kernels here. I wonder if it can be simulated under KVM...
>
> > (The first two commits themselves don't seem to be related but reverting
> > them is needed so that the last one can be cleanly reverted).
>
> What I really wonder is what else is going on there; it keeps finding a bunch
> of dentries _already_ on shrink list(s) of somebody else. And spins (with
> eviction of everything worthy not already on shrink lists and cond_resched()
> thrown in) to give whoever's trying to evict those suckers do their job.
>
> This means that we either have somebody stuck trying to evict a dentry, or
> that more and more dentries keep being added and evicted there. Is somebody
> sitting in a subdirectory of invalid one and trying to do lookups there,
> perhaps? But in that case we would have the same livelock in the older
> kernels, possibly harder to hit, but still there...
Not sure if this helps, but I did 'udevadm monitor' and here is the list
of events when I unplugged the USB stick:
KERNEL[134.584296] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/bsg/15:0:0:0 (bsg)
KERNEL[134.584442] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/scsi_generic/sg4 (scsi_generic)
KERNEL[134.584501] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/scsi_device/15:0:0:0 (scsi_device)
KERNEL[134.584572] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/scsi_disk/15:0:0:0 (scsi_disk)
KERNEL[134.584773] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/block/sdc/sdc1 (block)
KERNEL[134.584914] remove /devices/virtual/bdi/8:32 (bdi)
KERNEL[134.586116] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0/block/sdc (block)
KERNEL[134.586176] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0/15:0:0:0 (scsi)
KERNEL[134.586203] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/target15:0:0 (scsi)
KERNEL[134.586225] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15/scsi_host/host15 (scsi_host)
KERNEL[134.586249] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0/host15 (scsi)
KERNEL[134.586284] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3/3-10.3:1.0 (usb)
KERNEL[134.586329] remove /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10.3 (usb)
UDEV [134.589069] remove /devices/virtual/bdi/8:32 (bdi)
And right after that the systemd-udev starts consuming 100% CPU, if I
check from htop or top.
> FWIW, older kernels just went ahead, picked those already-on-shrink-list
> dentries and did dentry_kill(), hopefully not at the time when the owner of
> shrink list got around to removing the neighbor from that list. With
> list corruption in case it happened at just the wrong moment.
>
> I don't have Fedora anywhere outside of KVM test images, and it'll take
> a while to inflict it on actual hardware; in the meanwhile, could you
> hit alt-sysrq-t after it gets stuck and post the results? At least that
> would give some idea whether it's somebody stuck on trying to evict a dentry
> or a stream of new dentries being added and killed there.
I attached the dmesg with 'echo t > /proc/sysrq-trigger' included.
View attachment "dmesg.out" of type "text/plain" (473596 bytes)
Powered by blists - more mailing lists