[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fc2a539-e4cc-e057-6cf0-da7b3953be6e@linux.dev>
Date: Thu, 23 Mar 2023 11:50:48 +0800
From: Guoqing Jiang <guoqing.jiang@...ux.dev>
To: Yu Kuai <yukuai1@...weicloud.com>, logang@...tatee.com,
pmenzel@...gen.mpg.de, agk@...hat.com, snitzer@...nel.org,
song@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com,
Marc Smith <msmith626@...il.com>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap
sync_thread in action_store"
On 3/23/23 09:36, Yu Kuai wrote:
> Hi,
>
> 在 2023/03/22 22:32, Guoqing Jiang 写道:
>>>> Could you explain how the same work can be re-queued? Isn't the
>>>> PENDING_BIT
>>>> is already set in t3? I believe queue_work shouldn't do that per
>>>> the comment
>>>> but I am not expert ...
>>>
>>> This is not related to workqueue, it is just because raid10
>>> reinitialize the work that is already queued,
>>
>> I am trying to understand the possibility.
>>
>>> like I discribed later in t3:
>>>
>>> t2:
>>> md_check_recovery:
>>> INIT_WORK -> clear pending
>>> queue_work -> set pending
>>> list_add_tail
>>> ...
>>>
>>> t3: -> work is still pending
>>> md_check_recovery:
>>> INIT_WORK -> clear pending
>>> queue_work -> set pending
>>> list_add_tail -> list is corrupted
>>
>> First, t2 and t3 can't be run in parallel since reconfig_mutex must
>> be held. And if sync_thread existed,
>> the second process would unregister and reap sync_thread which means
>> the second process will
>> call INIT_WORK and queue_work again.
>>
>> Maybe your description is valid, I would prefer call work_pending and
>> flush_workqueue instead of
>> INIT_WORK and queue_work.
>
> This is not enough, it's right this can avoid list corruption, but the
> worker function md_start_sync just register a sync_thread, and
> md_do_sync() can still in progress, hence this can't prevent a new
> sync_thread to start while the old one is not done, some other problems
> like deadlock can still be triggered.
>
>>> Of course, our 5.10 and mainline are the same,
>>>
>>> there are some tests:
>>>
>>> First the deadlock can be reporduced reliably, test script is simple:
>>>
>>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>>
>> So this is raid10 while the previous problem was appeared in raid456,
>> I am not sure it is the same
>> issue, but let's see.
>
> Ok, I'm not quite familiar with raid456 yet, however, the problem is
> still related to that action_store hold mutex to unregister sync_thread,
> right?
Yes and no, the previous raid456 bug also existed because it can't get
stripe while
barrier is involved as you mentioned in patch 4, which is different.
>
>>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be
>>> reporduced
>>> reliably, usually it takes 2+ days to triggered a problem, and each
>>> time
>>> problem phenomenon can be different, I'm hacking the kernel and add
>>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>>> test can trigger the BUG_ON:
>>
>> Also your debug patch obviously added large delay which make the
>> calltrace happen, I doubt
>> if user can hit it in real life. Anyway, will try below test from my
>> side.
>>
>>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>>> sleep 5
>>> echo 1 > /sys/module/md_mod/parameters/set_delay
>>> echo idle > /sys/block/md0/md/sync_action &
>>> sleep 5
>>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
Combined your debug patch with above steps. Seems you are
1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
in step 3.
>>>
>>> test result:
>>>
>>> [ 228.390237] md_check_recovery: running is set
>>> [ 228.391376] md_check_recovery: queue new sync thread
>>> [ 233.671041] action_store unregister success! delay 10s
>>> [ 233.689276] md_check_recovery: running is set
>>> [ 238.722448] md_check_recovery: running is set
>>> [ 238.723328] md_check_recovery: queue new sync thread
>>> [ 238.724851] md_do_sync: before new wor, sleep 10s
>>> [ 239.725818] md_do_sync: delay done
>>> [ 243.674828] action_store delay done
>>> [ 243.700102] md_reap_sync_thread: running is cleared!
>>> [ 243.748703] ------------[ cut here ]------------
>>> [ 243.749656] kernel BUG at drivers/md/md.c:9084!
>>
>> After your debug patch applied, is L9084 points to below?
>>
>> 9084 mddev->curr_resync = MaxSector;
>
> In my environment, it's a BUG_ON() that I added in md_do_sync:
Ok, so we are on different code base ...
> 9080 skip:
> 9081 /* set CHANGE_PENDING here since maybe another update is
> needed,
> 9082 ┊* so other nodes are informed. It should be harmless for
> normal
> 9083 ┊* raid */
> 9084 BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> 9085 set_mask_bits(&mddev->sb_flags, 0,
> 9086 ┊ BIT(MD_SB_CHANGE_PENDING) |
> BIT(MD_SB_CHANGE_DEVS));
>
>>
>> I don't understand how it triggers below calltrace, and it has
>> nothing to do with
>> list corruption, right?
>
> Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
> is cleared while sync_thread is still in progress.
sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And
change BUG_ON
like this makes more sense to me.
+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
I think there might be racy window like you described but it should be
really small, I prefer
to just add a few lines like this instead of revert and introduce new
lock to resolve the same
issue (if it is).
@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char
*page, size_t len)
if (mddev->sync_thread) {
sector_t save_rp = mddev->reshape_position;
+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page,
size_t len)
mddev->reshape_position = save_rp;
set_bit(MD_RECOVERY_INTR,
&mddev->recovery);
md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
}
mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
if (!md_is_rdwr(mddev) &&
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
MD_RECOVERY_ERROR, /* sync-action interrupted because
io-error */
MD_RECOVERY_WAIT, /* waiting for pers->start() to finish */
MD_RESYNCING_REMOTE, /* remote node is running resync thread */
+MD_RECOVERY_DONOT, /* for a nasty racy issue */
};
TBH, I am reluctant to see the changes in the series, it can only be
considered
acceptable with conditions:
1. the previous raid456 bug can be fixed in this way too, hopefully Marc
or others
can verify it.
2. pass all the tests in mdadm.
Thanks,
Guoqing
Powered by blists - more mailing lists