linux-kernel - Re: [PATCH] block: fix deadlock between bd_link_disk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6211cd80-6573-656d-e198-befe074030d8@huaweicloud.com>
Date: Sun, 18 Feb 2024 15:47:12 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Song Liu <song@...nel.org>, Li Nan <linan666@...weicloud.com>
Cc: axboe@...nel.dk, linux-raid@...r.kernel.org, linux-block@...r.kernel.org,
 linux-kernel@...r.kernel.org, yi.zhang@...wei.com, houtao1@...wei.com,
 yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] block: fix deadlock between bd_link_disk_holder and
 partition scan

Hi,

在 2024/02/17 3:03, Song Liu 写道:
> On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@...nel.org> wrote:
>>
>> On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@...weicloud.com> wrote:
>>>
>>>
>>>
>>> 在 2024/2/8 14:50, Song Liu 写道:
>>>> On Wed, Feb 7, 2024 at 1:32 AM <linan666@...weicloud.com> wrote:
>>>>>
>>>>> From: Li Nan <linan122@...wei.com>
>>>>>
>>>>> 'open_mutex' of gendisk is used to protect open/close block devices. But
>>>>> in bd_link_disk_holder(), it is used to protect the creation of symlink
>>>>> between holding disk and slave bdev, which introduces some issues.
>>>>>
>>>>> When bd_link_disk_holder() is called, the driver is usually in the process
>>>>> of initialization/modification and may suspend submitting io. At this
>>>>> time, any io hold 'open_mutex', such as scanning partitions, can cause
>>>>> deadlocks. For example, in raid:
>>>>>
>>>>> T1                              T2
>>>>> bdev_open_by_dev
>>>>>    lock open_mutex [1]
>>>>>    ...
>>>>>     efi_partition
>>>>>     ...
>>>>>      md_submit_bio
>>>>>                                   md_ioctl mddev_syspend
>>>>>                                     -> suspend all io
>>>>>                                    md_add_new_disk
>>>>>                                     bind_rdev_to_array
>>>>>                                      bd_link_disk_holder
>>>>>                                       try lock open_mutex [2]
>>>>>       md_handle_request
>>>>>        -> wait mddev_resume
>>>>>
>>>>> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume
>>>>> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs.
>>>>>
>>>>> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'.
>>>>
>>>> Is this to fix [1]? Do we need some Fixes and/or Closes tags?
>>>>
>>>
>>> No. Just use another way to fix [2], and both [2] and this patch can fix
>>> the issue. I am not sure about the root cause of [1] yet.
>>>
>>> [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045
>>>
>>>> Could you please add steps to reproduce this issue?
>>>
>>> We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl()
>>> as below, and then:
>>>     1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc]  #create a raid
>>>     2. echo 1 > /sys/module/md_mod/parameters/error_inject  #enable sleep
>>>     3. 'mdadm --add /dev/md0 /dev/sda'  #add a disk to raid
>>>     4. submit ioctl BLKRRPART to raid within 10s.
>>
>> The analysis makes sense. I also hit the issue a couple times without adding
>> extra delays. But I am not sure whether this is the best fix (I didn't find real
>> issues with it either).
> 
> To be extra safe and future proof, we can do something like the
> following to only
> suspend the array for ADD_NEW_DISK on not-running arrays.
> 
> This appear to solve the problem reported in
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=218459
> 
> Thanks,
> Song
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9e41a9aaba8b..395911d5f4d6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd)
>          }
>   }
> 
> -static bool md_ioctl_need_suspend(unsigned int cmd)
> +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd)
>   {
>          switch (cmd) {
>          case ADD_NEW_DISK:
> +               return mddev->pers != NULL;

Did you check already that this problem is not related that 'active_io'
is leaked for flush IO?

I don't understand the problem reported yet. If 'mddev->pers' is not set
yet, md_submit_bio() will return directly, and 'active_io' should not be
grabbed in the first place.

md_run() is the only place to convert 'mddev->pers' from NULL to a real
personality, and it's protected by 'reconfig_mutex', however,
md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there
is a race condition:

md_ioctl_need_suspend		array_state_store
  // mddev->pers is NULL, return false
				 mddev_lock
				 do_md_run
				  mddev->pers = xxx
				 mddev_unlock

  // mddev_suspend is not called
  mddev_lock
  md_add_new_disk
   if (mddev->pers)
    md_import_device
    bind_rdev_to_array
    add_bound_rdev
     mddev->pers->hot_add_disk
     -> hot add disk without suspending

Thanks,
Kuai

>          case HOT_ADD_DISK:
>          case HOT_REMOVE_DISK:
>          case SET_BITMAP_FILE:
> @@ -7625,6 +7626,7 @@ static int md_ioctl(struct block_device *bdev,
> blk_mode_t mode,
>          void __user *argp = (void __user *)arg;
>          struct mddev *mddev = NULL;
>          bool did_set_md_closing = false;
> +       bool need_suspend;
> 
>          if (!md_ioctl_valid(cmd))
>                  return -ENOTTY;
> @@ -7716,8 +7718,11 @@ static int md_ioctl(struct block_device *bdev,
> blk_mode_t mode,
>          if (!md_is_rdwr(mddev))
>                  flush_work(&mddev->sync_work);
> 
> -       err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) :
> -                                          mddev_lock(mddev);
> +       need_suspend = md_ioctl_need_suspend(mddev, cmd);
> +       if (need_suspend)
> +               err = mddev_suspend_and_lock(mddev);
> +       else
> +               err = mddev_lock(mddev);
>          if (err) {
>                  pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
>                           err, cmd);
> @@ -7846,8 +7851,10 @@ static int md_ioctl(struct block_device *bdev,
> blk_mode_t mode,
>              err != -EINVAL)
>                  mddev->hold_active = 0;
> 
> -       md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) :
> -                                    mddev_unlock(mddev);
> +       if (need_suspend)
> +               mddev_unlock_and_resume(mddev);
> +       else
> +               mddev_unlock(mddev);
> 
>   out:
>          if(did_set_md_closing)
> .
>