linux-kernel - Re: Linux-next parallel cp workload hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160518083150.GB6551@dhcp12-144.nay.redhat.com>
Date:	Wed, 18 May 2016 16:31:50 +0800
From:	Xiong Zhou <xzhou@...hat.com>
To:	Dave Chinner <david@...morbit.com>
Cc:	Xiong Zhou <xzhou@...hat.com>, linux-next@...r.kernel.org,
	viro@...iv.linux.org.uk, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: Linux-next parallel cp workload hang

Hi,

On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > Hi,
> > 
> > Parallel cp workload (xfstests generic/273) hangs like blow.
> > It's reproducible with a small chance, less the 1/100 i think.
> > 
> > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > xfs with loop or block device. Ext4 survived several rounds
> > of testing.
> > 
> > Linux next 20160510 tree hangs within 500 rounds testing several
> > times. The same tree with vfs parallel lookup patchset reverted
> > survived 900 rounds testing. Reverted commits are attached.  > 
> What hardware?

A HP prototype host.

> 
> > Bisecting in this patchset ided this commit:
> > 
> > 3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit
> > commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f
> > Author: Al Viro <viro@...iv.linux.org.uk>
> > Date:   Wed Apr 20 23:42:46 2016 -0400
> > 
> >     simple local filesystems: switch to ->iterate_shared()
> >     
> >     no changes needed (XFS isn't simple, but it has the same parallelism
> >     in the interesting parts exercised from CXFS).
> >     
> > With this commit reverted on top of Linux next 0510 tree, 5000+ rounds
> > of testing passed.
> > 
> > Although 2000 rounds testing had been conducted before good/bad
> > verdict, i'm not 100 percent sure about all this, since it's
> > so hard to hit, and i am not that lucky..
> > 
> > Bisect log and full blocked state process dump log are also attached.
> > 
> > Furthermore, this was first hit when testing fs dax on nvdimm,
> > however it's reproducible without dax mount option, and also
> > reproducible on loop device, just seems harder to hit.
> > 
> > Thanks,
> > Xiong
> > 
> > [0.771475] INFO: task cp:49033 blocked for more than 120 seconds.
> > [0.794263]       Not tainted 4.6.0-rc6-next-20160504 #5
> > [0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [0.841801] cp              D ffff880b4e977928     0 49033  49014
> > 0x00000080
> > [0.868923]  ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80
> > ffff880b4e978000
> > [0.897504]  7fffffffffffffff 0000000000000002 0000000000000000
> > ffff880b8d712b80
> > [0.925234]  ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0
> > ffff880b4e9779e8
> > [0.953237] Call Trace:
> > [0.958314]  [<ffffffff816cbc25>] schedule+0x35/0x80
> > [0.974854]  [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0
> > [0.995728]  [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40
> > [1.015351]  [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs]
> > [1.040182]  [<ffffffff816cdb89>] __down_common+0xaa/0x104
> > [1.059021]  [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs]
> > [1.081357]  [<ffffffff816cdc00>] __down+0x1d/0x1f
> > [1.097166]  [<ffffffff810d2e81>] down+0x41/0x50
> > [1.112869]  [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs]
> > [1.134504]  [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs]
> > [1.156871]  [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs]
> 
> So what's holding that directory data buffer lock? It should only be
> held if there is either IO in progress, or a modification of the
> buffer in progress that is blocked somewhere else.
> 
> > [1.180010]  [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs]
> > [1.203538]  [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs]
> > [1.229194]  [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs]
> > [1.251948]  [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs]
> > [1.275736]  [<ffffffffa040923e>] xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs]
> > [1.305094]  [<ffffffff816cdc92>] ? down_read+0x12/0x30
> > [1.323787]  [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs]
> > [1.345114]  [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0 [xfs]
> > [1.371818]  [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs]
> 
> So we should be holding the ilock in shared mode here...
> 
> > [1.393471]  [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs]
> > [1.416874]  [<ffffffff8122cec3>] iterate_dir+0x173/0x190
> > [1.436709]  [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70
> > [1.460951]  [<ffffffff8122d338>] SyS_getdents+0x98/0x120
> > [1.480566]  [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190
> > [1.500909]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110
> > [1.520847]  [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25
> > [1.545372] INFO: task cp:49040 blocked for more than 120 seconds.
> > [1.568933]       Not tainted 4.6.0-rc6-next-20160504 #5
> > [1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [1.618544] cp              D ffff880b91463b00     0 49040  49016
> > 0x00000080
> > [1.645502]  ffff880b91463b00 ffff880464d5c140 ffff88029b475700
> > ffff880b91464000
> > [1.674145]  ffff880411c42610 ffffffffffffffff ffff880411c42628
> > ffff8802c10bc610
> > [1.702834]  ffff880b91463b18 ffffffff816cbc25 ffff88029b475700
> > ffff880b91463b88
> > [1.731501] Call Trace:
> > [1.736866]  [<ffffffff816cbc25>] schedule+0x35/0x80
> > [1.754119]  [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140
> > [1.777411]  [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40
> > [xfs]
> > [1.805090]  [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30
> > [1.830482]  [<ffffffff816cdca0>] down_read+0x20/0x30
> > [1.848505]  [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs]
> > [1.869293]  [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40
> 
> And it this is an attempt to lock the inode shared, so if that is
> failing while there's another shared holder, than means there's an
> exclusive waiter queued up (i.e. read iheld -> write blocked -> read
> blocked).
> 
> 
> So looking at dump-g273xfs0510:
> 
> [  845.727907] INFO: task cp:40126 blocked for more than 120 seconds.
> [  845.751175]       Not tainted 4.6.0-rc7-next-20160510 #9
> [  845.770011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  845.799967] cp              D ffff88046b64bcb8     0 40126  40106 0x00000080
> [  845.826636]  ffff88046b64bcb8 ffffffffa0458d36 ffff8804685c2b80 0000000002408240
> [  845.855232]  ffff88046b64c000 ffff880426b08450 ffff880426b08450 ffff880426b08468
> [  845.883836]  ffff880426b08458 ffff88046b64bcd0 ffffffff816ce9c5 ffff8804685c2b80
> [  845.911854] Call Trace:
> [  845.917076]  [<ffffffffa0458d36>] ? kmem_zone_alloc+0x96/0x120 [xfs]
> [  845.940302]  [<ffffffff816ce9c5>] schedule+0x35/0x80
> [  845.957118]  [<ffffffff816d1534>] rwsem_down_write_failed+0x114/0x230
> [  845.981242]  [<ffffffffa045a294>] ? xlog_grant_head_check+0x54/0x100 [xfs]
> [  846.007673]  [<ffffffffa04492e4>] ? xfs_vn_update_time+0xd4/0x160 [xfs]
> [  846.032009]  [<ffffffff81358217>] call_rwsem_down_write_failed+0x17/0x30
> [  846.056909]  [<ffffffff816d0a7d>] down_write+0x2d/0x40
> [  846.074580]  [<ffffffffa044b78f>] xfs_ilock+0xcf/0x110 [xfs]
> [  846.094681]  [<ffffffffa04492e4>] xfs_vn_update_time+0xd4/0x160 [xfs]
> [  846.118308]  [<ffffffff81238648>] touch_atime+0xa8/0xd0
> [  846.136385]  [<ffffffff8122f132>] iterate_dir+0x112/0x190
> [  846.155995]  [<ffffffff8122f608>] SyS_getdents+0x98/0x120
> [  846.175819]  [<ffffffff8122f1b0>] ? iterate_dir+0x190/0x190
> [  846.195808]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110
> [  846.215019]  [<ffffffff816d2761>] entry_SYSCALL64_slow_path+0x25/0x25
> 
> 
> Timestamp updates are transactional, requiring exclusive access to
> the directory ilock. That's why we have a train-smash of down_read()
> calls blocking in these dumps - they are queued up behind this call.
> This is expected under workloads where a couple of hundred threads
> are all reading the same directory and trying to open files in it.
> 
> So why isn't the current read holder making progress? Why is it
> blocked on a directory data buffer lock when the only accesses
> should be read only?
> 
> There's nothing in either of the stack traces that tells me this. I
> suspect that there may be some kind of buffer readahead vs buffer
> read interaction that results in leaking a buffer lock, but it is
> not obvious what that may be yet.
> 
> Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and
> it doesn't trigger any warnings or asserts, can you then try to
> reproduce it while tracing the following events:
> 
> 	xfs_buf_lock
> 	xfs_buf_lock_done
> 	xfs_buf_trylock
> 	xfs_buf_unlock
> 
> So we might be able to see if there's an unexpected buffer
> locking/state pattern occurring when the hang occurs?

Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers
on. There are some trace output after hang for a while. Tracing output,
dmesg, block stack dump and config are attached. BTW, reproduced on
mainline tree this time, to commit 0b7962a.

> 
> Also, if you run on slower storage, does the hang get harder or
> easier to hit?

Harder to me.
>From easy to hard, or hard to harder, they are nvdimm device with dax
mount option , then nvdimm device without dax mount option, lood device.

Thanks,
Xiong

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com

Download attachment "g273-mainline.tar.gz" of type "application/gzip" (139695 bytes)