[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160518095409.GX26977@dastard>
Date: Wed, 18 May 2016 19:54:09 +1000
From: Dave Chinner <david@...morbit.com>
To: Xiong Zhou <xzhou@...hat.com>
Cc: linux-next@...r.kernel.org, viro@...iv.linux.org.uk,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: Linux-next parallel cp workload hang
On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> Hi,
>
> On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > Hi,
> > >
> > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > It's reproducible with a small chance, less the 1/100 i think.
> > >
> > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > xfs with loop or block device. Ext4 survived several rounds
> > > of testing.
> > >
> > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > times. The same tree with vfs parallel lookup patchset reverted
> > > survived 900 rounds testing. Reverted commits are attached. >
> > What hardware?
>
> A HP prototype host.
description? cpus, memory, etc? I want to have some idea of what
hardware I need to reproduce this...
xfs_info from the scratch filesystem would also be handy.
> > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and
> > it doesn't trigger any warnings or asserts, can you then try to
> > reproduce it while tracing the following events:
> >
> > xfs_buf_lock
> > xfs_buf_lock_done
> > xfs_buf_trylock
> > xfs_buf_unlock
> >
> > So we might be able to see if there's an unexpected buffer
> > locking/state pattern occurring when the hang occurs?
>
> Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers
> on. There are some trace output after hang for a while.
I'm not actually interested in the trace after the hang - I'm
interested in what happened leading up to the hang. The output
you've given me tell me that the directory block at offset is locked
but nothing in the trace tells me what locked it.
Can I suggest using trace-cmd to record the events, then when the
test hangs kill the check process so that trace-cmd terminates and
gathers the events. Then dump the report to a text file and attach
that?
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists