[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d95aa962-9750-c27c-639a-2362bdb32f41@cloud.ionos.com>
Date: Wed, 20 Jan 2021 17:33:38 +0100
From: Guoqing Jiang <guoqing.jiang@...ud.ionos.com>
To: Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
linux-raid@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
Hi Donald,
On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
>
> I've reported a problem in this thread in December:
>
> "We are using raid6 on several servers. Occasionally we had failures,
> where a mdX_raid6 process seems to go into a busy loop and all I/O to
> the md device blocks. We've seen this on various kernel versions." It
> was clear, that this is related to "mdcheck" running, because we found,
> that the command, which should stop the scrubbing in the morning (`echo
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>
> On 12/21/20, I've reported, that the problem might be caused by a
> failure of the underlying block device driver, because I've found
> "inflight" counters of the block devices not being zero. However, this
> is not the case. We were able to run into the mdX_raid6 looping
> condition a few times again, but the non-zero inflight counters have not
> been observed again.
>
> I was able to collect a lot of additional information from a blocked
> system.
>
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command
> is waiting at kthread_stop to stop the sync thread. [
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>
> - The sync thread ("md1_resync") does not finish, because its blocked at
>
> [<0>] raid5_get_active_stripe+0x4c4/0x660 # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>
> - yes, gdb confirms that `conf->cache_state` is 0x03 (
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )
The resync thread is blocked since it can't get sh from inactive list,
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle
> /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.
>
> - We have lots of active stripes:
>
> root@...dbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/stripe_cache_active
> 27534
There are too many active stripes, so this is false:
atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)
so raid5_get_active_stripe has to wait till it becomes true, either
increase max_nr_stripes or decrease active_stripes.
1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So
please check the memory usage of your system.
Another thing is you can try to increase the number of sh manually by
write new number to stripe_cache_size if there is enough memory.
2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set
> - handle_stripe() doesn't make progress:
>
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>
> In dmesg, we see the debug output from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925
> but never from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
> , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
> , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
> , check:4, reconstruct:0
>
> The sector numbers repeat after some time. We have only the following
> variants of stripe state and "check":
>
> state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED,
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED,
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED,
> check_state_idle
>
> - We have MD_SB_CHANGE_PENDING set:
because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call
atomic_dec(&conf->active_stripes).
Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be
called by md_reap_sync_thread. But md_reap_sync_thread stuck with
unregister sync_thread (it was blocked in raid5_get_active_stripe).
Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d
can't break from the while(1) loop because "if (!batch_size &&
!released)" is false. I would assume released is '0' since
> released_stripes = {
> first = 0x0
> }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain
the busy loop situation.
>
> root@...dbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/array_state
> write-pending
>
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)
since rdev_set_badblocks could set them, could you check if there is
badblock of underlying device (sd*)?
>
> So it can be assumed that handle_stripe breaks out at
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>
> - The system can manually be freed from the deadlock:
>
> When `echo active > /sys/block/md1/md/array_state` is used, the
> scrubbing and other I/O continue. Probably because of
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520
Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked
process can make progress.
>
> I, of coruse, don't fully understand it yet. Any ideas?
>
> I append some data from a hanging raid... (mddev, r5conf and a sample
> stripe_head from handle_list with it first disks)
These data did help for investigation!
Thanks,
Guoqing
Powered by blists - more mailing lists