linux-kernel - Re: [PATCH v5 7/8] mm: Only IPI CPUs to drain local pages if they exist

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120105222116.GF27881@csn.ul.ie>
Date:	Thu, 5 Jan 2012 22:21:16 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Russell King - ARM Linux <linux@....linux.org.uk>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Gilad Ben-Yossef <gilad@...yossef.com>,
	linux-kernel@...r.kernel.org, Chris Metcalf <cmetcalf@...era.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Frederic Weisbecker <fweisbec@...il.com>, linux-mm@...ck.org,
	Pekka Enberg <penberg@...nel.org>,
	Matt Mackall <mpm@...enic.com>,
	Sasha Levin <levinsasha928@...il.com>,
	Rik van Riel <riel@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Greg KH <gregkh@...e.de>, linux-fsdevel@...r.kernel.org,
	Avi Kivity <avi@...hat.com>
Subject: Re: [PATCH v5 7/8] mm: Only IPI CPUs to drain local pages if they
 exist

(Adding Greg to cc to see if he recalls seeing issues with sysfs dentry
suffering from recursive locking recently)

On Thu, Jan 05, 2012 at 10:35:04AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 05, 2012 at 04:35:29PM +0000, Russell King - ARM Linux wrote:
> > On Thu, Jan 05, 2012 at 04:17:39PM +0000, Mel Gorman wrote:
> > > Link please?
> > 
> > Forwarded, as its still in my mailbox.
> > 
> > > I'm including a patch below under development that is
> > > intended to only cope with the page allocator case under heavy memory
> > > pressure. Currently it does not pass testing because eventually RCU
> > > gets stalled with the following trace
> > > 
> > > [ 1817.176001]  [<ffffffff810214d7>] arch_trigger_all_cpu_backtrace+0x87/0xa0
> > > [ 1817.176001]  [<ffffffff810c4779>] __rcu_pending+0x149/0x260
> > > [ 1817.176001]  [<ffffffff810c48ef>] rcu_check_callbacks+0x5f/0x110
> > > [ 1817.176001]  [<ffffffff81068d7f>] update_process_times+0x3f/0x80
> > > [ 1817.176001]  [<ffffffff8108c4eb>] tick_sched_timer+0x5b/0xc0
> > > [ 1817.176001]  [<ffffffff8107f28e>] __run_hrtimer+0xbe/0x1a0
> > > [ 1817.176001]  [<ffffffff8107f581>] hrtimer_interrupt+0xc1/0x1e0
> > > [ 1817.176001]  [<ffffffff81020ef3>] smp_apic_timer_interrupt+0x63/0xa0
> > > [ 1817.176001]  [<ffffffff81449073>] apic_timer_interrupt+0x13/0x20
> > > [ 1817.176001]  [<ffffffff8116c135>] vfsmount_lock_local_lock+0x25/0x30
> > > [ 1817.176001]  [<ffffffff8115c855>] path_init+0x2d5/0x370
> > > [ 1817.176001]  [<ffffffff8115eecd>] path_lookupat+0x2d/0x620
> > > [ 1817.176001]  [<ffffffff8115f4ef>] do_path_lookup+0x2f/0xd0
> > > [ 1817.176001]  [<ffffffff811602af>] user_path_at_empty+0x9f/0xd0
> > > [ 1817.176001]  [<ffffffff81154e7b>] vfs_fstatat+0x4b/0x90
> > > [ 1817.176001]  [<ffffffff81154f4f>] sys_newlstat+0x1f/0x50
> > > [ 1817.176001]  [<ffffffff81448692>] system_call_fastpath+0x16/0x1b
> > > 
> > > It might be a separate bug, don't know for sure.
> 

I rebased the patch on top of 3.2 and tested again with a bunch of
debugging options set (PROVE_RCU, PROVE_LOCKING etc). Same results. CPU
hotplug is a lot more reliable and less likely to hang but eventually
gets into trouble.

Taking a closer look though, I don't think this is an RCU problem. It's
just the messenger.

> Do you get multiple RCU CPU stall-warning messages? 

Yes, one roughly every 50000 jiffies or so (HZ=250).

[  878.315029] INFO: rcu_sched detected stall on CPU 3 (t=16250 jiffies)
[  878.315032] INFO: rcu_sched detected stall on CPU 6 (t=16250 jiffies)
[ 1072.878669] INFO: rcu_sched detected stall on CPU 3 (t=65030 jiffies)
[ 1072.878672] INFO: rcu_sched detected stall on CPU 6 (t=65030 jiffies)
[ 1267.442308] INFO: rcu_sched detected stall on CPU 3 (t=113810 jiffies)
[ 1267.442312] INFO: rcu_sched detected stall on CPU 6 (t=113810 jiffies)
[ 1462.005948] INFO: rcu_sched detected stall on CPU 3 (t=162590 jiffies)
[ 1462.005952] INFO: rcu_sched detected stall on CPU 6 (t=162590 jiffies)
[ 1656.569588] INFO: rcu_sched detected stall on CPU 3 (t=211370 jiffies)
[ 1656.569592] INFO: rcu_sched detected stall on CPU 6 (t=211370 jiffies)
[ 1851.133229] INFO: rcu_sched detected stall on CPU 6 (t=260150 jiffies)
[ 1851.133233] INFO: rcu_sched detected stall on CPU 3 (t=260150 jiffies)
[ 2045.696868] INFO: rcu_sched detected stall on CPU 3 (t=308930 jiffies)
[ 2045.696872] INFO: rcu_sched detected stall on CPU 6 (t=308930 jiffies)
[ 2240.260508] INFO: rcu_sched detected stall on CPU 6 (t=357710 jiffies)
[ 2240.260511] INFO: rcu_sched detected stall on CPU 3 (t=357710 jiffies)

> If so, it can
> be helpful to look at how the stack frame changes over time.  These
> stalls are normally caused by a loop in the kernel with preemption
> disabled, though other scenarios can also cause them.
> 

The stacks are not changing much over time and start with this;

[  878.315029] INFO: rcu_sched detected stall on CPU 3 (t=16250 jiffies)
[  878.315032] INFO: rcu_sched detected stall on CPU 6 (t=16250 jiffies)
[  878.315036] Pid: 4422, comm: udevd Not tainted 3.2.0-guardipi-v1r6 #2
[  878.315037] Call Trace:
[  878.315038]  <IRQ>  [<ffffffff810a8b20>] __rcu_pending+0x8e/0x36c
[  878.315052]  [<ffffffff81071b9a>] ? tick_nohz_handler+0xdc/0xdc
[  878.315054]  [<ffffffff810a8f04>] rcu_check_callbacks+0x106/0x172
[  878.315056]  [<ffffffff810528e0>] update_process_times+0x3f/0x76
[  878.315058]  [<ffffffff81071c0a>] tick_sched_timer+0x70/0x9a
[  878.315060]  [<ffffffff8106654e>] __run_hrtimer+0xc7/0x157
[  878.315062]  [<ffffffff810667ec>] hrtimer_interrupt+0xba/0x18a
[  878.315065]  [<ffffffff8134fbad>] smp_apic_timer_interrupt+0x86/0x99
[  878.315067]  [<ffffffff8134dbf3>] apic_timer_interrupt+0x73/0x80
[  878.315068]  <EOI>  [<ffffffff81345f34>] ? retint_restore_args+0x13/0x13
[  878.315072]  [<ffffffff81139591>] ? __shrink_dcache_sb+0x7d/0x19f
[  878.315075]  [<ffffffff81008c6e>] ? native_read_tsc+0x1/0x16
[  878.315077]  [<ffffffff811df434>] ? delay_tsc+0x3a/0x82
[  878.315079]  [<ffffffff811df4a1>] __delay+0xf/0x11
[  878.315081]  [<ffffffff811e51e5>] do_raw_spin_lock+0xb5/0xf9
[  878.315083]  [<ffffffff81345561>] _raw_spin_lock+0x39/0x3d
[  878.315085]  [<ffffffff8113972a>] ? shrink_dcache_parent+0x77/0x28c
[  878.315087]  [<ffffffff8113972a>] shrink_dcache_parent+0x77/0x28c
[  878.315089]  [<ffffffff8113741d>] ? have_submounts+0x13e/0x1bd
[  878.315092]  [<ffffffff81185970>] sysfs_dentry_revalidate+0xaa/0xbe
[  878.315093]  [<ffffffff8112e731>] do_lookup+0x263/0x2fc
[  878.315096]  [<ffffffff8119ca13>] ? security_inode_permission+0x1e/0x20
[  878.315098]  [<ffffffff8112f33d>] link_path_walk+0x1e2/0x763
[  878.315099]  [<ffffffff8112fd66>] path_lookupat+0x5c/0x61a
[  878.315102]  [<ffffffff810f4810>] ? might_fault+0x89/0x8d
[  878.315104]  [<ffffffff810f47c7>] ? might_fault+0x40/0x8d
[  878.315105]  [<ffffffff8113034e>] do_path_lookup+0x2a/0xa8
[  878.315107]  [<ffffffff81132a51>] user_path_at_empty+0x5d/0x97
[  878.315109]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  878.315111]  [<ffffffff81345c4f>] ? _raw_spin_unlock_irqrestore+0x44/0x5a
[  878.315112]  [<ffffffff81132a9c>] user_path_at+0x11/0x13
[  878.315115]  [<ffffffff81128b64>] vfs_fstatat+0x44/0x71
[  878.315117]  [<ffffffff81128bef>] vfs_lstat+0x1e/0x20
[  878.315118]  [<ffffffff81128c10>] sys_newlstat+0x1f/0x40
[  878.315120]  [<ffffffff810759a8>] ? trace_hardirqs_on_caller+0x12d/0x164
[  878.315122]  [<ffffffff811e057e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  878.315124]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  878.315126]  [<ffffffff8134d082>] system_call_fastpath+0x16/0x1b
[  878.557790] Pid: 5704, comm: udevd Not tainted 3.2.0-guardipi-v1r6 #2
[  878.564226] Call Trace:
[  878.566677]  <IRQ>  [<ffffffff810a8b20>] __rcu_pending+0x8e/0x36c
[  878.572783]  [<ffffffff81071b9a>] ? tick_nohz_handler+0xdc/0xdc
[  878.578702]  [<ffffffff810a8f04>] rcu_check_callbacks+0x106/0x172
[  878.584794]  [<ffffffff810528e0>] update_process_times+0x3f/0x76
[  878.590798]  [<ffffffff81071c0a>] tick_sched_timer+0x70/0x9a
[  878.596459]  [<ffffffff8106654e>] __run_hrtimer+0xc7/0x157
[  878.601944]  [<ffffffff810667ec>] hrtimer_interrupt+0xba/0x18a
[  878.607778]  [<ffffffff8134fbad>] smp_apic_timer_interrupt+0x86/0x99
[  878.614129]  [<ffffffff8134dbf3>] apic_timer_interrupt+0x73/0x80
[  878.620134]  <EOI>  [<ffffffff81051e66>] ? run_timer_softirq+0x49/0x32a
[  878.626759]  [<ffffffff81139591>] ? __shrink_dcache_sb+0x7d/0x19f
[  878.632851]  [<ffffffff811df402>] ? delay_tsc+0x8/0x82
[  878.637988]  [<ffffffff811df4a1>] __delay+0xf/0x11
[  878.642778]  [<ffffffff811e51e5>] do_raw_spin_lock+0xb5/0xf9
[  878.648437]  [<ffffffff81345561>] _raw_spin_lock+0x39/0x3d
[  878.653920]  [<ffffffff8113972a>] ? shrink_dcache_parent+0x77/0x28c
[  878.660186]  [<ffffffff8113972a>] shrink_dcache_parent+0x77/0x28c
[  878.666277]  [<ffffffff8113741d>] ? have_submounts+0x13e/0x1bd
[  878.672107]  [<ffffffff81185970>] sysfs_dentry_revalidate+0xaa/0xbe
[  878.678372]  [<ffffffff8112e731>] do_lookup+0x263/0x2fc
[  878.683596]  [<ffffffff8119ca13>] ? security_inode_permission+0x1e/0x20
[  878.690207]  [<ffffffff8112f33d>] link_path_walk+0x1e2/0x763
[  878.695866]  [<ffffffff8112fd66>] path_lookupat+0x5c/0x61a
[  878.701350]  [<ffffffff810f4810>] ? might_fault+0x89/0x8d
[  878.706747]  [<ffffffff810f47c7>] ? might_fault+0x40/0x8d
[  878.712145]  [<ffffffff8113034e>] do_path_lookup+0x2a/0xa8
[  878.717630]  [<ffffffff81132a51>] user_path_at_empty+0x5d/0x97
[  878.723463]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  878.729295]  [<ffffffff81345c4f>] ? _raw_spin_unlock_irqrestore+0x44/0x5a
[  878.736080]  [<ffffffff81132a9c>] user_path_at+0x11/0x13
[  878.741391]  [<ffffffff81128b64>] vfs_fstatat+0x44/0x71
[  878.746616]  [<ffffffff81128bef>] vfs_lstat+0x1e/0x20
[  878.751668]  [<ffffffff81128c10>] sys_newlstat+0x1f/0x40
[  878.756981]  [<ffffffff810759a8>] ? trace_hardirqs_on_caller+0x12d/0x164
[  878.763678]  [<ffffffff811e057e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  878.770116]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  878.775949]  [<ffffffff8134d082>] system_call_fastpath+0x16/0x1b
[  908.769486] BUG: spinlock lockup on CPU#6, udevd/4422
[  908.774547]  lock: ffff8803b4c701c8, .magic: dead4ead, .owner: udevd/5709, .owner_cpu: 4

Seeing that the owner was CPU 4, I found earlier in the log

[  815.244051] BUG: spinlock lockup on CPU#4, udevd/5709
[  815.249103]  lock: ffff8803b4c701c8, .magic: dead4ead, .owner: udevd/5709, .owner_cpu: 4
[  815.258430] Pid: 5709, comm: udevd Not tainted 3.2.0-guardipi-v1r6 #2
[  815.264866] Call Trace:
[  815.267329]  [<ffffffff811e507d>] spin_dump+0x88/0x8d
[  815.272388]  [<ffffffff811e5206>] do_raw_spin_lock+0xd6/0xf9
[  815.278062]  [<ffffffff81345561>] ? _raw_spin_lock+0x39/0x3d
[  815.283720]  [<ffffffff8113972a>] ? shrink_dcache_parent+0x77/0x28c
[  815.289986]  [<ffffffff8113972a>] ? shrink_dcache_parent+0x77/0x28c
[  815.296249]  [<ffffffff8113741d>] ? have_submounts+0x13e/0x1bd
[  815.302080]  [<ffffffff81185970>] ? sysfs_dentry_revalidate+0xaa/0xbe
[  815.308515]  [<ffffffff8112e731>] ? do_lookup+0x263/0x2fc
[  815.313915]  [<ffffffff8119ca13>] ? security_inode_permission+0x1e/0x20
[  815.320524]  [<ffffffff8112f33d>] ? link_path_walk+0x1e2/0x763
[  815.326357]  [<ffffffff8112fd66>] ? path_lookupat+0x5c/0x61a
[  815.332014]  [<ffffffff810f4810>] ? might_fault+0x89/0x8d
[  815.337410]  [<ffffffff810f47c7>] ? might_fault+0x40/0x8d
[  815.342807]  [<ffffffff8113034e>] ? do_path_lookup+0x2a/0xa8
[  815.348465]  [<ffffffff81132a51>] ? user_path_at_empty+0x5d/0x97
[  815.354474]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  815.360303]  [<ffffffff81345c4f>] ? _raw_spin_unlock_irqrestore+0x44/0x5a
[  815.367085]  [<ffffffff81132a9c>] ? user_path_at+0x11/0x13
[  815.372569]  [<ffffffff81128b64>] ? vfs_fstatat+0x44/0x71
[  815.377965]  [<ffffffff81128bef>] ? vfs_lstat+0x1e/0x20
[  815.383192]  [<ffffffff81128c10>] ? sys_newlstat+0x1f/0x40
[  815.388676]  [<ffffffff810759a8>] ? trace_hardirqs_on_caller+0x12d/0x164
[  815.395373]  [<ffffffff811e057e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  815.401811]  [<ffffffff8107447f>] ? trace_hardirqs_off+0xd/0xf
[  815.407642]  [<ffffffff8134d082>] ? system_call_fastpath+0x16/0x1b

The trace is not particularly useful but it looks like it
recursively locked even though the message doesn't say that.  If the
shrink_dcache_parent() entry is accurate, that corresponds to this

static int select_parent(struct dentry * parent)
{
        struct dentry *this_parent;
        struct list_head *next;
        unsigned seq;
        int found = 0;
        int locked = 0;

        seq = read_seqbegin(&rename_lock);
again: 
        this_parent = parent;
        spin_lock(&this_parent->d_lock); <----- HERE

I'm not overly clear on how VFS locking is meant to work but it almost
looks as if the last reference to an inode is being dropped during a
sysfs path lookup. Is that meant to happen?

Judging by sysfs_dentry_revalidate() - possibly not. It looks like
we must have reached out_bad: and called shrink_dcache_parent() on a
dentry that was already locked by the running process. Not sure how
this could have happened - Greg, does this look familiar?

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/