linux-kernel - Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <31e7f59e-579a-7812-632d-059ed0a6d441@huaweicloud.com>
Date:   Thu, 23 Mar 2023 09:36:27 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Guoqing Jiang <guoqing.jiang@...ux.dev>,
        Yu Kuai <yukuai1@...weicloud.com>, logang@...tatee.com,
        pmenzel@...gen.mpg.de, agk@...hat.com, snitzer@...nel.org,
        song@...nel.org
Cc:     linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
        yi.zhang@...wei.com, yangerkun@...wei.com,
        Marc Smith <msmith626@...il.com>,
        "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap
 sync_thread in action_store"

Hi,

在 2023/03/22 22:32, Guoqing Jiang 写道:
>>> Could you explain how the same work can be re-queued? Isn't the 
>>> PENDING_BIT
>>> is already set in t3? I believe queue_work shouldn't do that per the 
>>> comment
>>> but I am not expert ...
>>
>> This is not related to workqueue, it is just because raid10
>> reinitialize the work that is already queued, 
> 
> I am trying to understand the possibility.
> 
>> like I discribed later in t3:
>>
>> t2:
>> md_check_recovery:
>>  INIT_WORK -> clear pending
>>  queue_work -> set pending
>>   list_add_tail
>> ...
>>
>> t3: -> work is still pending
>> md_check_recovery:
>>  INIT_WORK -> clear pending
>>  queue_work -> set pending
>>   list_add_tail -> list is corrupted
> 
> First, t2 and t3 can't be run in parallel since reconfig_mutex must be 
> held. And if sync_thread existed,
> the second process would unregister and reap sync_thread which means the 
> second process will
> call INIT_WORK and queue_work again.
> 
> Maybe your description is valid, I would prefer call work_pending and 
> flush_workqueue instead of
> INIT_WORK and queue_work.

This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.

>> Of course, our 5.10 and mainline are the same,
>>
>> there are some tests:
>>
>> First the deadlock can be reporduced reliably, test script is simple:
>>
>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
> 
> So this is raid10 while the previous problem was appeared in raid456, I 
> am not sure it is the same
> issue, but let's see.

Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?

>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
>> reliably, usually it takes 2+ days to triggered a problem, and each time
>> problem phenomenon can be different, I'm hacking the kernel and add
>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>> test can trigger the BUG_ON:
> 
> Also your debug patch obviously added large delay which make the 
> calltrace happen, I doubt
> if user can hit it in real life. Anyway, will try below test from my side.
> 
>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>> sleep 5
>> echo 1 > /sys/module/md_mod/parameters/set_delay
>> echo idle > /sys/block/md0/md/sync_action &
>> sleep 5
>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
>>
>> test result:
>>
>> [  228.390237] md_check_recovery: running is set
>> [  228.391376] md_check_recovery: queue new sync thread
>> [  233.671041] action_store unregister success! delay 10s
>> [  233.689276] md_check_recovery: running is set
>> [  238.722448] md_check_recovery: running is set
>> [  238.723328] md_check_recovery: queue new sync thread
>> [  238.724851] md_do_sync: before new wor, sleep 10s
>> [  239.725818] md_do_sync: delay done
>> [  243.674828] action_store delay done
>> [  243.700102] md_reap_sync_thread: running is cleared!
>> [  243.748703] ------------[ cut here ]------------
>> [  243.749656] kernel BUG at drivers/md/md.c:9084!
> 
> After your debug patch applied, is L9084 points to below?
> 
> 9084                                 mddev->curr_resync = MaxSector;

In my environment, it's a BUG_ON() that I added in md_do_sync:

9080  skip:
9081         /* set CHANGE_PENDING here since maybe another update is 
needed,
9082         ┊* so other nodes are informed. It should be harmless for 
normal
9083         ┊* raid */
9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085         set_mask_bits(&mddev->sb_flags, 0,
9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | 
BIT(MD_SB_CHANGE_DEVS));

> 
> I don't understand how it triggers below calltrace, and it has nothing 
> to do with
> list corruption, right?

Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.

Thanks,
Kuai