[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6531a3b0-915d-8078-b265-95231405ac4d@huawei.com>
Date: Mon, 25 Sep 2023 23:02:12 +0800
From: Wenchao Hao <haowenchao2@...wei.com>
To: "James E . J . Bottomley" <jejb@...ux.ibm.com>,
"Martin K . Petersen" <martin.petersen@...cle.com>,
<linux-scsi@...r.kernel.org>
CC: <linux-kernel@...r.kernel.org>, <louhongxiang@...wei.com>
Subject: Re: [PATCH 1/2] scsi: core: scsi_device_online() return false if
state is SDEV_CANCEL
On 2023/9/22 17:36, Wenchao Hao wrote:
> SDEV_CANCEL is set when removing device and scsi_device_online() should
> return false if sdev_state is SDEV_CANCEL.
>
> IO hang would be caused if return true when state is SDEV_CANCEL with
> following order:
>
> T1: T2:scsi_error_handler
> __scsi_remove_device()
> scsi_device_set_state(sdev, SDEV_CANCEL)
> scsi_eh_flush_done_q()
> if (scsi_device_online(sdev))
> scsi_queue_insert(scmd,...)
>
> The command added by scsi_queue_insert() would never be handled any
> more.
>
> Signed-off-by: Wenchao Hao <haowenchao2@...wei.com>
> ---
> include/scsi/scsi_device.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..c498a12f7715 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -517,7 +517,8 @@ static inline int scsi_device_online(struct scsi_device *sdev)
> {
> return (sdev->sdev_state != SDEV_OFFLINE &&
> sdev->sdev_state != SDEV_TRANSPORT_OFFLINE &&
> - sdev->sdev_state != SDEV_DEL);
> + sdev->sdev_state != SDEV_DEL &&
> + sdev->sdev_state != SDEV_CANCEL);
> }
> static inline int scsi_device_blocked(struct scsi_device *sdev)
> {
Return false when if sdev_state is SDEV_CANCEL seems change some flow in
error handle, but I don't know if we should introduce these changes.
I think it's both ok to finish the failed command or try more recovery steps.
For example, in scsi_eh_bus_device_reset(), when scsi_try_bus_device_reset()
returned SUCCEED but the sdev_state is SDEV_CANCEL, should skip TUR and just
call scsi_eh_finish_cmd() to add this LUN's error command to done_q?
We can address the issue of IO hang described in this patch by running
scsi_device's queue regardless of the scsi_device's state and it seems
a better solution because the main reason of IO hang is as following:
scsi_restart_operations()
-> scsi_run_host_queues()
-> shost_for_each_device() // skip scsi_device with SDEV_DEL
// or SDEV_CANCEL state
Powered by blists - more mailing lists