[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80e0f8aa-6d53-3109-37c0-b07c5a4b558c@huaweicloud.com>
Date: Mon, 25 Sep 2023 09:11:12 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Donald Buczek <buczek@...gen.mpg.de>,
Dragan Stancevic <dragan@...ncevic.com>,
Yu Kuai <yukuai1@...weicloud.com>, song@...nel.org
Cc: guoqing.jiang@...ux.dev, it+raid@...gen.mpg.de,
linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
msmith626@...il.com, "yangerkun@...wei.com" <yangerkun@...wei.com>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
Hi,
在 2023/09/24 22:35, Donald Buczek 写道:
> On 9/17/23 10:55, Donald Buczek wrote:
>> On 9/14/23 08:03, Donald Buczek wrote:
>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>> Hi Donald-
>>>> [...]
>>>> Here is a list of changes for 6.1:
>>>>
>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in
>>>> action_store()
>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in
>>>> action_store"
>>>
>>> Thanks!
>>>
>>> I've put these patches on v6.1.52. I've started a script which
>>> transitions the three md-devices of a very active backup server
>>> through idle->check->idle every 6 minutes a few ours ago. It went
>>> through ~400 iterations till now. No lock-ups so far.
>>
>> Oh dear, looks like the deadlock problem is _not_fixed with these
>> patches.
>
> Some more info after another incident:
>
> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any
> of the above patches)
> - The symptoms are not exactly the same as with the original year-old
> problem. Differences:
> - - mdX_raid6 is NOT busy looping
> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not
> "write pending"
> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does
> not resolve the deadlock
> - - After hours in the deadlock state the system resumed operation when
> a script of mine read(!) lots of sysfs files.
> - But in both cases, `echo idle >
> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O
> operation on the raid.
>
> The fact that we didn't hit the problem for many month on 5.15.94 might
> hint that it was introduced between 5.15.94 and 5.15.131
>
> We'll try to reproduce the problem on a test machine for analysis, but
> this make take time (vacation imminent for one...).
>
> But its not like these patches caused the problem. Any maybe they _did_
> fix the original problem, as we didn't hit that one.
Sorry for the late reply, yes, this looks like a different problem. I'm
pretty confident that the orignal problem is fixed since that echo
idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
sync_thread to be done.
I'll check patches between 5.15.94 and 5.15.131.
Thanks,
Kuai
>
> Best
>
> Donald
>
Powered by blists - more mailing lists