[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <066cb9e2-f583-b2c7-f42c-861568d38e2f@molgen.mpg.de>
Date: Sun, 27 Dec 2020 18:34:24 +0100
From: Donald Buczek <buczek@...gen.mpg.de>
To: Brian Foster <bfoster@...hat.com>
Cc: linux-xfs@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
it+linux-xfs@...gen.mpg.de
Subject: Re: v5.10.1 xfs deadlock
On 18.12.20 19:35, Donald Buczek wrote:
> On 18.12.20 16:35, Brian Foster wrote:
>> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>>> On 17.12.20 20:43, Brian Foster wrote:
>>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>>> Dear xfs developer,
>>>>>
>>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>>
>>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>>
>>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>>
>>>>> root 7322 2167 0 Dec16 pts/1 00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>> root 7329 2169 0 Dec16 pts/1 00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>> root 13856 2170 0 Dec16 pts/1 00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>> root 13899 2168 0 Dec16 pts/1 00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>>
>>
>> Do you have any indication of whether these workloads actually hung or
>> just became incredibly slow?
>
> There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)
>
>>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>>
>>>>> It stands out, that there are many (549 for md0, but only 10 for md1) "xfs-conv" threads all with stacks like this
>>>>>
>>>>> [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>> [<0>] __xfs_trans_commit+0xab/0x320
>>>>> [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>> [<0>] xfs_end_ioend+0xc6/0x110
>>>>> [<0>] xfs_end_io+0xad/0xe0
>>>>> [<0>] process_one_work+0x1dd/0x3e0
>>>>> [<0>] worker_thread+0x2d/0x3b0
>>>>> [<0>] kthread+0x118/0x130
>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>
>>>>> xfs_log_commit_cil+0x6cc is
>>>>>
>>>>> xfs_log_commit_cil()
>>>>> xlog_cil_push_background(log)
>>>>> xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>>
>>
>> This looks like the transaction commit throttling code. That was
>> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
>> Throttle commits on delayed background CIL push"). The purpose of that
>> change was to prevent the CIL from growing too large. FWIW, I don't
>> recall that being a functional problem so it should be possible to
>> simply remove that blocking point and see if that avoids the problem or
>> if we simply stall out somewhere else, if you wanted to give that a
>> test.
>
> Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.
I'm now able to reliably reproduce the deadlock with a little less complex setup (e.g. with only one filesystem involved). One key to that was to run the test against a freshly created filesystem (mkfs).
And, yes, you are right: When I revert ef565ab8cc2e ("xfs: Throttle commits on delayed background CIL push") and 7ee6dfa2a245 ("xfs: fix use-after-free on CIL context on shutdown") the deadlock seems to be gone.
Best
Donald
Powered by blists - more mailing lists