lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 25 Sep 2023 09:11:12 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Donald Buczek <buczek@...gen.mpg.de>,
        Dragan Stancevic <dragan@...ncevic.com>,
        Yu Kuai <yukuai1@...weicloud.com>, song@...nel.org
Cc:     guoqing.jiang@...ux.dev, it+raid@...gen.mpg.de,
        linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
        msmith626@...il.com, "yangerkun@...wei.com" <yangerkun@...wei.com>,
        "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition

Hi,

在 2023/09/24 22:35, Donald Buczek 写道:
> On 9/17/23 10:55, Donald Buczek wrote:
>> On 9/14/23 08:03, Donald Buczek wrote:
>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>> Hi Donald-
>>>> [...]
>>>> Here is a list of changes for 6.1:
>>>>
>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in 
>>>> action_store()
>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in 
>>>> action_store"
>>>
>>> Thanks!
>>>
>>> I've put these patches on v6.1.52. I've started a script which 
>>> transitions the three md-devices of a very active backup server 
>>> through idle->check->idle every 6 minutes a few ours ago.  It went 
>>> through ~400 iterations till now. No lock-ups so far.
>>
>> Oh dear, looks like the deadlock problem is _not_fixed with these 
>> patches.
> 
> Some more info after another incident:
> 
> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any 
> of the above patches)
> - The symptoms are not exactly the same as with the original year-old 
> problem. Differences:
> - - mdX_raid6 is NOT busy looping
> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not 
> "write pending"
> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does 
> not resolve the deadlock
> - - After hours in the deadlock state the system resumed operation when 
> a script of mine read(!) lots of sysfs files.
> - But in both cases, `echo idle > 
> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O 
> operation on the raid.
> 
> The fact that we didn't hit the problem for many month on 5.15.94 might 
> hint that it was introduced between 5.15.94 and 5.15.131
> 
> We'll try to reproduce the problem on a test machine for analysis, but 
> this make take time (vacation imminent for one...).
> 
> But its not like these patches caused the problem. Any maybe they _did_ 
> fix the original problem, as we didn't hit that one.

Sorry for the late reply, yes, this looks like a different problem. I'm
pretty confident that the orignal problem is fixed since that echo
idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
sync_thread to be done.

I'll check patches between 5.15.94 and 5.15.131.

Thanks,
Kuai

> 
> Best
> 
>    Donald
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ