linux-kernel - Re: [PATCH] block: fix deadlock between sd_remove & sd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51411297-f579-4229-a72c-c5bd5f27df34@vivo.com>
Date: Wed, 17 Jul 2024 18:15:08 +0800
From: YangYang <yang.yang@...o.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
 linux-kernel@...r.kernel.org, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] block: fix deadlock between sd_remove & sd_release

On 2024/7/17 11:53, Yu Kuai wrote:
> Hi,
> 
> 在 2024/07/16 17:30, YangYang 写道:
>> On 2024/7/16 17:15, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2024/07/16 16:38, Yang Yang 写道:
>>>> Our test report the following hung task:
>>>>
>>>> [ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds.
>>>> [ 2538.459427] Call trace:
>>>> [ 2538.459430]  __switch_to+0x174/0x338
>>>> [ 2538.459436]  __schedule+0x628/0x9c4
>>>> [ 2538.459442]  schedule+0x7c/0xe8
>>>> [ 2538.459447]  schedule_preempt_disabled+0x24/0x40
>>>> [ 2538.459453]  __mutex_lock+0x3ec/0xf04
>>>> [ 2538.459456]  __mutex_lock_slowpath+0x14/0x24
>>>> [ 2538.459459]  mutex_lock+0x30/0xd8
>>>> [ 2538.459462]  del_gendisk+0xdc/0x350
>>>> [ 2538.459466]  sd_remove+0x30/0x60
>>>> [ 2538.459470]  device_release_driver_internal+0x1c4/0x2c4
>>>> [ 2538.459474]  device_release_driver+0x18/0x28
>>>> [ 2538.459478]  bus_remove_device+0x15c/0x174
>>>> [ 2538.459483]  device_del+0x1d0/0x358
>>>> [ 2538.459488]  __scsi_remove_device+0xa8/0x198
>>>> [ 2538.459493]  scsi_forget_host+0x50/0x70
>>>> [ 2538.459497]  scsi_remove_host+0x80/0x180
>>>> [ 2538.459502]  usb_stor_disconnect+0x68/0xf4
>>>> [ 2538.459506]  usb_unbind_interface+0xd4/0x280
>>>> [ 2538.459510]  device_release_driver_internal+0x1c4/0x2c4
>>>> [ 2538.459514]  device_release_driver+0x18/0x28
>>>> [ 2538.459518]  bus_remove_device+0x15c/0x174
>>>> [ 2538.459523]  device_del+0x1d0/0x358
>>>> [ 2538.459528]  usb_disable_device+0x84/0x194
>>>> [ 2538.459532]  usb_disconnect+0xec/0x300
>>>> [ 2538.459537]  hub_event+0xb80/0x1870
>>>> [ 2538.459541]  process_scheduled_works+0x248/0x4dc
>>>> [ 2538.459545]  worker_thread+0x244/0x334
>>>> [ 2538.459549]  kthread+0x114/0x1bc
>>>>
>>>> [ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds.
>>>> [ 2538.461014] Call trace:
>>>> [ 2538.461016]  __switch_to+0x174/0x338
>>>> [ 2538.461021]  __schedule+0x628/0x9c4
>>>> [ 2538.461025]  schedule+0x7c/0xe8
>>>> [ 2538.461030]  blk_queue_enter+0xc4/0x160
>>>> [ 2538.461034]  blk_mq_alloc_request+0x120/0x1d4
>>>> [ 2538.461037]  scsi_execute_cmd+0x7c/0x23c
>>>> [ 2538.461040]  ioctl_internal_command+0x5c/0x164
>>>> [ 2538.461046]  scsi_set_medium_removal+0x5c/0xb0
>>>> [ 2538.461051]  sd_release+0x50/0x94
>>>> [ 2538.461054]  blkdev_put+0x190/0x28c
>>>> [ 2538.461058]  blkdev_release+0x28/0x40
>>>> [ 2538.461063]  __fput+0xf8/0x2a8
>>>> [ 2538.461066]  __fput_sync+0x28/0x5c
>>>> [ 2538.461070]  __arm64_sys_close+0x84/0xe8
>>>> [ 2538.461073]  invoke_syscall+0x58/0x114
>>>> [ 2538.461078]  el0_svc_common+0xac/0xe0
>>>> [ 2538.461082]  do_el0_svc+0x1c/0x28
>>>> [ 2538.461087]  el0_svc+0x38/0x68
>>>> [ 2538.461090]  el0t_64_sync_handler+0x68/0xbc
>>>> [ 2538.461093]  el0t_64_sync+0x1a8/0x1ac
>>>>
>>>>    T1:                T2:
>>>>    sd_remove
>>>>    del_gendisk
>>>>    __blk_mark_disk_dead
>>>>    blk_freeze_queue_start
>>>>    ++q->mq_freeze_depth
>>>>                    bdev_release
>>>>                   mutex_lock(&disk->open_mutex)
>>>>                    sd_release
>>>>                   scsi_execute_cmd
>>>>                   blk_queue_enter
>>>>                   wait_event(!q->mq_freeze_depth)
>>>
>>> This looks wrong, there is a condition blk_queue_dying() in
>>> blk_queue_enter().
>>
>>   584 static void __blk_mark_disk_dead(struct gendisk *disk)
>>   585 {
>>   ......
>>   591
>>   592     if (test_bit(GD_OWNS_QUEUE, &disk->state))
>>   593         blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
>>
>> SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in
>> this scenario.
> 
> Yes, you're right. Please explain this in commit message as well.

Got it.

>>
>> Thanks.
>>
>>>
>>> Thanks,
>>> Kuai
>>>
>>>>    mutex_lock(&disk->open_mutex)
>>>>
>>>> This is a classic ABBA deadlock. To fix the deadlock, make sure we don't
>>>> try to acquire disk->open_mutex after freezing the queue.
>>>>
>>>> Signed-off-by: Yang Yang <yang.yang@...o.com>
>>>> ---
>>>>   block/genhd.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/genhd.c b/block/genhd.c
>>>> index 8f1f3c6b4d67..c5fca3e893a0 100644
>>>> --- a/block/genhd.c
>>>> +++ b/block/genhd.c
>>>> @@ -663,12 +663,12 @@ void del_gendisk(struct gendisk *disk)
>>>>        */
>>>>       if (!test_bit(GD_DEAD, &disk->state))
>>>>           blk_report_disk_dead(disk, false);
>>>> -    __blk_mark_disk_dead(disk);
>>>>       /*
>>>>        * Drop all partitions now that the disk is marked dead.
>>>>        */
>>>>       mutex_lock(&disk->open_mutex);
>>>> +    __blk_mark_disk_dead(disk);
>>>>       xa_for_each_start(&disk->part_tbl, idx, part, 1)
>>>>           drop_partition(part);
>>>>       mutex_unlock(&disk->open_mutex);
> 
> Take a look at del_gendisk(), between freeze and unfreeze queue, I
> notice that there is also device_del() here, which means it will wait
> for all sysfs readers/writers to be done, so related sysfs api can't
> issue IO to trigger blk_queue_enter(). And this behaviour does't look
> reasonable, we never forbit this.
> 
> Then take a look at scsi sysfs api, I notice that scsi_rescan_device()
> can be called and issue IO. Hence other than the 'open_mutex', same
> deadlock still exist:
> 
> t1:            t2
> store_state_field
>   scsi_rescan_device
>    scsi_attach_vpd
>     scsi_get_vpd_buf
>      scsi_vpd_inquiry
>       scsi_execute_cmd
>              del_gendisk
>               bdev_unhash
>               blk_freeze_queue_start
>        blk_queue_enter
>               device_del
>                kobject_del
>                 sysfs_remove_dir
> 
> I didn't test this, just by code review, and I could be wrong. And
> I don't check all the procedures between freeze and unfreeze yet.

These sysfs nodes are in different directories, the scsi node located
at /sys/bus/scsi/devices/0:0:0:0 and the gendisk node located at
/sys/block/sda. Would it be necessary to wait for the completion of
the scsi sysfs nodes' read/write operations before removing the
gendisk sysfs node?

Thanks.