[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <513ea05e-cc2e-e0c8-43cc-6636b0631cdf@huaweicloud.com>
Date: Thu, 7 Sep 2023 20:53:30 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Mariusz Tkaczyk <mariusz.tkaczyk@...ux.intel.com>,
Yu Kuai <yukuai1@...weicloud.com>
Cc: AceLan Kao <acelan@...il.com>, Song Liu <song@...nel.org>,
Guoqing Jiang <guoqing.jiang@...ux.dev>,
Bagas Sanjaya <bagasdotme@...il.com>,
Christoph Hellwig <hch@....de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Regressions <regressions@...ts.linux.dev>,
Linux RAID <linux-raid@...r.kernel.org>,
"yangerkun@...wei.com" <yangerkun@...wei.com>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: Infiniate systemd loop when power off the machine with multiple
MD RAIDs
Hi,
在 2023/09/07 20:41, Mariusz Tkaczyk 写道:
> On Thu, 7 Sep 2023 20:14:03 +0800
> Yu Kuai <yukuai1@...weicloud.com> wrote:
>
>> Hi,
>>
>> 在 2023/09/07 19:26, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:
>>>> On Thu, 7 Sep 2023 10:04:11 +0800
>>>> Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:
>>>>>> On Wed, 6 Sep 2023 14:26:30 +0800
>>>>>> AceLan Kao <acelan@...il.com> wrote:
>>>>>>> From previous testing, I don't think it's an issue in systemd, so I
>>>>>>> did a simple test and found the issue is gone.
>>>>>>> You only need to add a small delay in md_release(), then the issue
>>>>>>> can't be reproduced.
>>>>>>>
>>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>>>>> index 78be7811a89f..ef47e34c1af5 100644
>>>>>>> --- a/drivers/md/md.c
>>>>>>> +++ b/drivers/md/md.c
>>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
>>>>>>> {
>>>>>>> struct mddev *mddev = disk->private_data;
>>>>>>>
>>>>>>> + msleep(10);
>>>>>>> BUG_ON(!mddev);
>>>>>>> atomic_dec(&mddev->openers);
>>>>>>> mddev_put(mddev);
>>>>>>
>>>>>> I have repro and I tested it on my setup. It is not working for me.
>>>>>> My setup could be more "advanced" to maximalize chance of reproduction:
>>>>>>
>>>>>> # cat /proc/mdstat
>>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
>>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
>>>>>> 7126394880 blocks super external:/md127/0 128k chunks
>>>>>>
>>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>> 104857600 blocks super external:/md126/0 64K chunks 2
>>>>>> near-copies
>>>>>> [4/4] [UUUU]
>>>>>>
>>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
>>>>>> 2655765504 blocks super external:/md126/1 level 5, 32k chunk,
>>>>>> algorithm 0 [4/4] [UUUU]
>>>>>>
>>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
>>>>>> 99614720 blocks super external:/md125/0 [2/2] [UU]
>>>>>>
>>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
>>>>>> 10402 blocks super external:imsm
>>>>>>
>>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
>>>>>> nvme4n1[0](S)
>>>>>> 20043 blocks super external:imsm
>>>>>>
>>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
>>>>>> 10402 blocks super external:imsm
>>>>>>
>>>>>> I have almost 99% repro ratio, slowly moving forward..
>>>>>>
>>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array"
>>>>>> which
>>>>>> is successful but array is not stopped. For that reason it sets
>>>>>> "changed =
>>>>>> true".
>>>>>
>>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
>>>>> ls /dev/md* or other way?
>>>>
>>>> Hi Yu,
>>>>
>>>> It trusts return result, I confirmed that 0 is returned.
>>>> The most weird is we are returning 0 but array is still there, and it is
>>>> stopped again in next systemd loop. I don't understand why yet..
>>>>
>>>>>> Systemd-shutdown see the change and retries to check if there is
>>>>>> something
>>>>>> else which can be stopped now, and again, again...
>>>>>>
>>>>>> I will check what is returned first, it could be 0 or it could be
>>>>>> positive
>>>>>> errno (nit?) because systemd cares "if(r < 0)".
>>>>>
>>>>> I do noticed that there are lots of log about md123 stopped:
>>>>>
>>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.846845] md: md123 stopped.
>>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
>>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
>>>>> [ 1371.862941] md: md123 stopped.
>>>>>
>>>>> And md_ioctl->do_md_stop doesn't have error path after printing this
>>>>> log, hence 0 will be returned to user.
>>>>>
>>>>> The normal case is that:
>>>>>
>>>>> open md123
>>>>> ioctl STOP_ARRAY -> all rdev should be removed from array
>>>>> close md123 -> mddev will finally be freed by:
>>>>> md_release
>>>>> mddev_put
>>>>> set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
>>>>> mddev
>>>>> queue_work(md_misc_wq, &mddev->del_work)
>>>>>
>>>>> mddev_delayed_delete
>>>>> kobject_put(&mddev->kobj)
>>>>>
>>>>> md_kobj_release
>>>>> del_gendisk
>>>>> md_free_disk
>>>>> mddev_free
>>>>>
>>>> Ok thanks, I understand that md_release is called on descriptor
>>>> closing, right?
>>>>
>>>
>>> Yes, normally close md123 should drop that last reference.
>>>>
>>>>> Now that you can reporduce this problem 99%, can you dig deeper and find
>>>>> out what is wrong?
>>>>
>>>> Yes, working on it!
>>>>
>>>> My first idea was that mddev_get and mddev_put are missing on
>>>> md_ioctl() path
>>>> but it doesn't help for the issue. My motivation here was that
>>>> md_attr_store and
>>>> md_attr_show are using them.
>>>>
>>>> Systemd regenerates list of MD arrays on every loop and it is always
>>>> there, systemd is able to open file descriptor (maybe inactive?).
>>>
>>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the
>>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
>>> open and close the array in each loop?
>>
>> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
>> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
>> called md_set_readonly or do_md_stop").
>>
>> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
>> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
>> STOP_ARRAY as well.
>>
Can you try the following patch?
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3afd57622a0b..31b9cec7e4c0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev,
fmode_t mode,
err = -EBUSY;
goto out;
}
- did_set_md_closing = true;
+ if (cmd == STOP_ARRAY_RO)
+ did_set_md_closing = true;
mutex_unlock(&mddev->open_mutex);
sync_blockdev(bdev);
}
I think prevent array to be opened again after STOP_ARRAY might fix
this.
Thanks,
Kuai
> Not related with your findings but:
>
> I replaced if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> because that is the way to exit without queuing work:
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0fe7ab6e8ab9..80bd7446be94 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -618,8 +618,7 @@ static void mddev_delayed_delete(struct work_struct *ws);
>
>
>
> void mddev_put(struct mddev *mddev)
> {
> - if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> - return;
> + spin_lock(&all_mddevs_lock);
> if (!mddev->raid_disks && list_empty(&mddev->disks) &&
> mddev->ctime == 0 && !mddev->hold_active) {
> /* Array is not configured at all, and not held active,
> @@ -634,6 +633,7 @@ void mddev_put(struct mddev *mddev)
> INIT_WORK(&mddev->del_work, mddev_delayed_delete);
> queue_work(md_misc_wq, &mddev->del_work);
> }
> + atomic_dec(&mddev->active);
> spin_unlock(&all_mddevs_lock);
> }
>
> After that I got kernel panic but it seems that workqueue is scheduled:
>
> 51.535103] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [ 51.539115] ------------[ cut here ]------------
> [ 51.543867] #PF: supervisor read access in kernel mode
> 1;[3 9 m S5t1a.r5tPF: error_code(0x0000) - not-present page
> [ 51.543875] PGD 0 P4D 0
> .k[ 0 m5.1
> 54ops: 0000 [#1] PREEMPT SMP NOPTI
> [ 51.552207] refcount_t: underflow; use-after-free.
> [ 51.556820] CPU: 19 PID: 368 Comm: kworker/19:1 Not tainted 6.5.0+ #57
> [ 51.556825] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS
> WLYDCRB1.SYS.0027.P82.2204080829 04/08/2022 [ 51.561979] WARNING: CPU: 26
> PID: 376 at lib/refcount.c:28 refcount_warn_saturate+0x99/0xe0 [ 51.567273]
> Workqueue: mddev_delayed_delete [md_mod] [ 51.569822] Modules linked in:
> [ 51.574351] (events)
> [ 51.574353] RIP: 0010:process_one_work+0x10f/0x3d0
> [ 51.579155] configfs
>
> In my case, it seems to be IMSM container device is stopped in loop, which is an
> inactive from the start. It is not something I'm totally sure but it could lead
> us to the root cause. So far I know, the original reported uses IMSM arrays too.
>
> Thanks,
> Mariusz
>
> .
>
Powered by blists - more mailing lists