[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3aa3ea7e-334c-6382-88f9-34eaa6b355fe@huawei.com>
Date: Tue, 4 Jan 2022 20:10:19 +0800
From: Wenchao Hao <haowenchao@...wei.com>
To: Steffen Maier <maier@...ux.ibm.com>,
"James E . J . Bottomley" <jejb@...ux.ibm.com>,
"Martin K . Petersen" <martin.petersen@...cle.com>,
<linux-scsi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
James Smart <james.smart@...adcom.com>,
Dick Kennedy <dick.kennedy@...adcom.com>,
"Nilesh Javali" <njavali@...vell.com>,
<GR-QLogic-Storage-Upstream@...vell.com>,
Hannes Reinecke <hare@...e.de>,
Martin Wilck <martin.wilck@...e.com>,
linux-s390 <linux-s390@...r.kernel.org>,
Benjamin Block <bblock@...ux.ibm.com>,
Ming Lei <ming.lei@...hat.com>
CC: Zhiqiang Liu <liuzhiqiang26@...wei.com>,
Feilong Lin <linfeilong@...wei.com>, Wu Bo <wubo40@...wei.com>
Subject: Re: [PATCH] scsi: Do not break scan luns loop if add single lun
failed
On 2021/12/31 1:55, Steffen Maier wrote:
> On 12/26/21 00:29, Wenchao Hao wrote:
>> Failed to add a single lun does not mean all luns are unaccessible,
>> if we break the scan luns loop, the other luns reported by REPORT LUNS
>> command would not be probed any more.
>>
>> In this case, we might loss some luns which are accessible.
>
> Could you please add more details about the specific use case, where
> this actually was a problem, for my understanding?
>
When REPORT LUNS returns 4 luns which are lun0, lun1, lun2 and lun3.
If lun1 becomes inaccessible during the scan flow,
scsi_probe_and_add_lun() for lun1 would failed, lun2 and lun3 are still
accessible. scsi_report_lun_scan() would print error log and return 0,
and scsi_sequential_lun_scan() would not be called.
In this scenario, lun2 and lun3 would not been probed and added any
more, so we loss them.
>>
>> Signed-off-by: Wenchao Hao <haowenchao@...wei.com>
>> ---
>> drivers/scsi/scsi_scan.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 23e1c0acdeae..fee7ce082103 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1476,13 +1476,13 @@ static int scsi_report_lun_scan(struct
>> scsi_target *starget, blist_flags_t bflag
>> lun, NULL, NULL, rescan, NULL);
>> if (res == SCSI_SCAN_NO_RESPONSE) {
>> /*
>> - * Got some results, but now none, abort.
>> + * Got some results, but now none, abort this lun
>
> abort => skip ?
Yes, "skip" looks better than "abort".
>
>> */
>> sdev_printk(KERN_ERR, sdev,
>> "Unexpected response"
>> " from lun %llu while scanning, scan"
>> " aborted\n", (unsigned long long)lun);
>
> That message would no longer be correct with your change, as it would
> not abort the scan any more.
I would change "abort" to "skip" which makes it better.
>
>> - break;
>> + continue;
>> }
>> }
>> }
>
>
> Wouldn't this change existing semantics for LLDDs intentionally
> returning -ENXIO from their slave_alloc() callback in certain cases?:
>
>
Yes, it would print error message like "Unexpected response ..." for
every failed lun. I think it's reasonable, so we can know every failed
lun in one scan flow.
>> static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
> ...
>> if (shost->hostt->slave_alloc) {
>> ret = shost->hostt->slave_alloc(sdev);
>> if (ret) {
>> /*
>> * if LLDD reports slave not present, don't clutter
>> * console with alloc failure messages
>> */
>> if (ret == -ENXIO)
>> display_failure_msg = 0;
>> goto out_device_destroy;
> ...
>> out_device_destroy:
>> __scsi_remove_device(sdev);
>> out:
>> if (display_failure_msg)
>> printk(ALLOC_FAILURE_MSG, __func__);
>> return NULL;
>
>
> scsi_probe_and_add_lun() [such as called by scsi_report_lun_scan() for
> the case at hand] converts this case into a SCSI_SCAN_NO_RESPONSE return
> value.
>
>> static int scsi_probe_and_add_lun(struct scsi_target *starget,
> ...
>> int res = SCSI_SCAN_NO_RESPONSE, result_len = 256;
> ...
>> sdev = scsi_alloc_sdev(starget, lun, hostdata);
>> if (!sdev)
>> goto out;
> ...
>> out:
>> return res;
>
>
> Such as being used by zfcp:
>
>> static int zfcp_scsi_slave_alloc(struct scsi_device *sdev)
>> {
> ...
>> unit = zfcp_unit_find(port, zfcp_scsi_dev_lun(sdev));
>> if (unit)
>> put_device(&unit->dev);
>>
>> if (!unit && !(allow_lun_scan && npiv)) {
>> put_device(&port->dev);
>> return -ENXIO;
> ^^^^^^
>
> which implements an initiator-based LUN masking that is necessary for
> shared HBAs virtualized without NPIV.
> https://www.ibm.com/docs/en/linux-on-systems?topic=devices-manually-configured-fcp-luns
>
>
> While things might still work, as zfcp now "just" gets (much) more
> callbacks to slave_alloc() it has to end with -ENXIO, the user may get
> flooded with the error(!) sdev_printk on "Unexpected response from LUN
> ..." in scsi_report_lun_scan().
> In the worst case, we could get this message now 64k - 1 times in a zfcp
> scenario connected to IBM DS8000 storage being able to map (all) 64k
> volumes to a single initiator (HBA), where the user via zfcp sysfs
> decided to use only the first lun reported (for the vHBA).
>
64k - 1 times error log seems terrible. While I do not understand what
"where the user via zfcp sysfs decided to use only the first lun
reported (for the vHBA)" means.
Why would all luns slave_alloc() failed? This don't seem like a normal
scenario.
> Other LLLDs also seem to intentionally return -ENXIO from slave_alloc()
> callbacks, such as but not limited to lpfc or qla2xxx:
>
>> int fc_slave_alloc(struct scsi_device *sdev)
>> {
>> struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
>>
>> if (!rport || fc_remote_port_chkready(rport))
>> return -ENXIO;
>
>> static int
>> qla2xxx_slave_alloc(struct scsi_device *sdev)
>> {
>> struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
>>
>> if (!rport || fc_remote_port_chkready(rport))
>> return -ENXIO;
>
>
Powered by blists - more mailing lists