[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com>
Date: Thu, 23 Mar 2023 09:36:27 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Guoqing Jiang <guoqing.jiang@...ux.dev>,
Yu Kuai <yukuai1@...weicloud.com>, logang@...tatee.com,
pmenzel@...gen.mpg.de, agk@...hat.com, snitzer@...nel.org,
song@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com,
Marc Smith <msmith626@...il.com>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap
sync_thread in action_store"
Hi,
在 2023/03/22 22:32, Guoqing Jiang 写道:
>>> Could you explain how the same work can be re-queued? Isn't the
>>> PENDING_BIT
>>> is already set in t3? I believe queue_work shouldn't do that per the
>>> comment
>>> but I am not expert ...
>>
>> This is not related to workqueue, it is just because raid10
>> reinitialize the work that is already queued,
>
> I am trying to understand the possibility.
>
>> like I discribed later in t3:
>>
>> t2:
>> md_check_recovery:
>> INIT_WORK -> clear pending
>> queue_work -> set pending
>> list_add_tail
>> ...
>>
>> t3: -> work is still pending
>> md_check_recovery:
>> INIT_WORK -> clear pending
>> queue_work -> set pending
>> list_add_tail -> list is corrupted
>
> First, t2 and t3 can't be run in parallel since reconfig_mutex must be
> held. And if sync_thread existed,
> the second process would unregister and reap sync_thread which means the
> second process will
> call INIT_WORK and queue_work again.
>
> Maybe your description is valid, I would prefer call work_pending and
> flush_workqueue instead of
> INIT_WORK and queue_work.
This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.
>> Of course, our 5.10 and mainline are the same,
>>
>> there are some tests:
>>
>> First the deadlock can be reporduced reliably, test script is simple:
>>
>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>
> So this is raid10 while the previous problem was appeared in raid456, I
> am not sure it is the same
> issue, but let's see.
Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?
>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
>> reliably, usually it takes 2+ days to triggered a problem, and each time
>> problem phenomenon can be different, I'm hacking the kernel and add
>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>> test can trigger the BUG_ON:
>
> Also your debug patch obviously added large delay which make the
> calltrace happen, I doubt
> if user can hit it in real life. Anyway, will try below test from my side.
>
>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>> sleep 5
>> echo 1 > /sys/module/md_mod/parameters/set_delay
>> echo idle > /sys/block/md0/md/sync_action &
>> sleep 5
>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
>>
>> test result:
>>
>> [ 228.390237] md_check_recovery: running is set
>> [ 228.391376] md_check_recovery: queue new sync thread
>> [ 233.671041] action_store unregister success! delay 10s
>> [ 233.689276] md_check_recovery: running is set
>> [ 238.722448] md_check_recovery: running is set
>> [ 238.723328] md_check_recovery: queue new sync thread
>> [ 238.724851] md_do_sync: before new wor, sleep 10s
>> [ 239.725818] md_do_sync: delay done
>> [ 243.674828] action_store delay done
>> [ 243.700102] md_reap_sync_thread: running is cleared!
>> [ 243.748703] ------------[ cut here ]------------
>> [ 243.749656] kernel BUG at drivers/md/md.c:9084!
>
> After your debug patch applied, is L9084 points to below?
>
> 9084 mddev->curr_resync = MaxSector;
In my environment, it's a BUG_ON() that I added in md_do_sync:
9080 skip:
9081 /* set CHANGE_PENDING here since maybe another update is
needed,
9082 ┊* so other nodes are informed. It should be harmless for
normal
9083 ┊* raid */
9084 BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085 set_mask_bits(&mddev->sb_flags, 0,
9086 ┊ BIT(MD_SB_CHANGE_PENDING) |
BIT(MD_SB_CHANGE_DEVS));
>
> I don't understand how it triggers below calltrace, and it has nothing
> to do with
> list corruption, right?
Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.
Thanks,
Kuai
Powered by blists - more mailing lists