linux-kernel - Re: Linux-next parallel cp workload hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160518114617.GC6551@dhcp12-144.nay.redhat.com>
Date:	Wed, 18 May 2016 19:46:17 +0800
From:	Xiong Zhou <xzhou@...hat.com>
To:	Dave Chinner <david@...morbit.com>
Cc:	Xiong Zhou <xzhou@...hat.com>, linux-next@...r.kernel.org,
	viro@...iv.linux.org.uk, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: Linux-next parallel cp workload hang

Hi,

On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote:
> On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> > Hi,
> > 
> > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > > Hi,
> > > > 
> > > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > > It's reproducible with a small chance, less the 1/100 i think.
> > > > 
> > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > > xfs with loop or block device. Ext4 survived several rounds
> > > > of testing.
> > > > 
> > > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > > times. The same tree with vfs parallel lookup patchset reverted
> > > > survived 900 rounds testing. Reverted commits are attached.  > 
> > > What hardware?
> > 
> > A HP prototype host.
> 
> description? cpus, memory, etc? I want to have some idea of what
> hardware I need to reproduce this...

#lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:              2 CPU MHz:               2596.918
BogoMIPS:              5208.33
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

#free -m
        total        used        free      shared  buff/cache   available
Mem:    31782         623       27907           9        3251       30491
Swap:   10239           0       10239

> 
> xfs_info from the scratch filesystem would also be handy.

meta-data=/dev/pmem1             isize=256    agcount=4, agsize=131072 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=524288, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> > > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and
> > > it doesn't trigger any warnings or asserts, can you then try to
> > > reproduce it while tracing the following events:
> > > 
> > > 	xfs_buf_lock
> > > 	xfs_buf_lock_done
> > > 	xfs_buf_trylock
> > > 	xfs_buf_unlock
> > > 
> > > So we might be able to see if there's an unexpected buffer
> > > locking/state pattern occurring when the hang occurs?
> > 
> > Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers
> > on. There are some trace output after hang for a while.
> 
> I'm not actually interested in the trace after the hang - I'm
> interested in what happened leading up to the hang. The output
> you've given me tell me that the directory block at offset is locked
> but nothing in the trace tells me what locked it.
> 
> Can I suggest using trace-cmd to record the events, then when the
> test hangs kill the check process so that trace-cmd terminates and
> gathers the events. Then dump the report to a text file and attach
> that?

Sure. Trace report, dmesg, ps axjf after Ctrl+C are attached.

Thanks for the instructions and patient.
Xiong
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com

Download attachment "g273-trace-report.tar.gz" of type "application/gzip" (244506 bytes)