[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93d8d623-8aec-ad91-490c-a414c4926fb2@molgen.mpg.de>
Date: Tue, 26 Jan 2021 17:05:40 +0100
From: Donald Buczek <buczek@...gen.mpg.de>
To: Guoqing Jiang <guoqing.jiang@...ud.ionos.com>,
Song Liu <song@...nel.org>, linux-raid@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
Dear Guoqing,
On 26.01.21 15:06, Guoqing Jiang wrote:
>
>
> On 1/26/21 13:58, Donald Buczek wrote:
>>
>>
>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>
>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>> md_check_recovery(mddev);
>>> spin_lock_irq(&conf->device_lock);
>>> }
>>> +
>>> + if ((atomic_read(&conf->active_stripes)
>>> + < (conf->max_nr_stripes * 3 / 4) ||
>>> + (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>> + wake_up(&conf->wait_for_stripe);
>>> }
>>> pr_debug("%d stripes handled\n", handled);
>>
>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>
>> root@...th:~# cat /proc/$(pgrep md3_resync)/stack
>> [<0>] md_do_sync.cold+0x8ec/0x97c
>> [<0>] md_thread+0xab/0x160
>> [<0>] kthread+0x11b/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>
> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>
> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>
> wait_event(mddev->recovery_wait,
> test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
> !atomic_read(&mddev->recovery_active));
With that added, md3_resync goes into a loop, too. Not 100% busy, though.
root@...th:~# cat /proc/$(pgrep md3_resync)/stack
[<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30
Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.
Best
Donald
>
>> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
>
> Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.
>
> Anyway, will try the script and investigate more about the issue.
>
> Thanks,
> Guoqing
--
Donald Buczek
buczek@...gen.mpg.de
Tel: +49 30 8413 1433
Powered by blists - more mailing lists