[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35130b3f-c0fd-e2d6-e849-a5ceb6a2895f@linux.dev>
Date: Tue, 22 Aug 2023 20:41:15 +0800
From: Guoqing Jiang <guoqing.jiang@...ux.dev>
To: AceLan Kao <acelan@...il.com>
Cc: Song Liu <song@...nel.org>,
Mariusz Tkaczyk <mariusz.tkaczyk@...ux.intel.com>,
Bagas Sanjaya <bagasdotme@...il.com>,
Christoph Hellwig <hch@....de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Regressions <regressions@...ts.linux.dev>,
Linux RAID <linux-raid@...r.kernel.org>
Subject: Re: Infiniate systemd loop when power off the machine with multiple
MD RAIDs
Hi Acelan,
On 8/22/23 16:13, AceLan Kao wrote:
>>>>> Hello,
>>>>> The issue is reproducible with IMSM metadata too, around 20% of reboot
>>>>> hangs. I will try to raise the priority in the bug because it is valid
>>>>> high- the base functionality of the system is affected.
>>>> Since it it reproducible from your side, is it possible to turn the
>>>> reproduce steps into a test case
>>>> given the importance?
>> I didn't try to reproduce it locally yet because customer was able to
>> bisect the regression and it pointed them to the same patch so I connected it
>> and asked author to take a look first. At a first glance, I wanted to get
>> community voice to see if it is not something obvious.
>>
>> So far I know, customer is creating 3 IMSM raid arrays, one is the system
>> volume and do a reboot and it sporadically fails (around 20%). That is all.
>>
>>>> I guess If all arrays are set with MD_DELETED flag, then reboot might
>>>> hang, not sure whether
>>>> below (maybe need to flush wq as well before list_del) helps or not,
>>>> just FYI.
>>>>
>>>> @@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
>>>> *this,
>>>>
>>>> spin_lock(&all_mddevs_lock);
>>>> list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
>>>> - if (!mddev_get(mddev))
>>>> + if (!mddev_get(mddev)) {
>>>> + list_del(&mddev->all_mddevs);
>>>> continue;
>>>> + }
My suggestion is delete the list node under this scenario, did you try
above?
>>> I am still not able to reproduce this, probably due to differences in the
>>> timing. Maybe we only need something like:
>>>
>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
>>> index 5c3c19b8d509..ebb529b0faf8 100644
>>> --- i/drivers/md/md.c
>>> +++ w/drivers/md/md.c
>>> @@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
>>> *this,
>>>
>>> spin_lock(&all_mddevs_lock);
>>> list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
>>> - if (!mddev_get(mddev))
>>> + if (!mddev_get(mddev)) {
>>> + need_delay = 1;
>>> continue;
>>> + }
>>> spin_unlock(&all_mddevs_lock);
>>> if (mddev_trylock(mddev)) {
>>> if (mddev->pers)
>>>
>>>
>>> Thanks,
>>> Song
>> I will try to reproduce issue at Intel lab to check this.
>>
>> Thanks,
>> Mariusz
> Hi Guoqing,
>
> Here is the command how I trigger the issue, have to do it around 10
> times to make sure the issue is reproducible
>
> echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action && sudo
> grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-rc77
> 06a74159504-dirty" && head -c 1G < /dev/urandom > myfile1 && sleep 180
> && head -c 1G < /dev/urandom > myfile2 && sleep 1 && cat /proc/mdstat
> && sleep 1 && rm myfile1 &&
> sudo reboot
Is the issue still reproducible with remove below from cmd?
echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action
Just want to know if resync thread is related with the issue or not.
> And the patch to add need_delay doesn't work.
My assumption is that mddev_get always returns NULL, so set need_delay
wouldn't help.
Thanks,
Guoqing
Powered by blists - more mailing lists