linux-kernel - Re: [PATCH] scsi: Do not break scan luns loop if add single lun failed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa72bd76-2af5-202d-8a2c-afb5a700b6c0@linux.ibm.com>
Date:   Thu, 30 Dec 2021 18:55:40 +0100
From:   Steffen Maier <maier@...ux.ibm.com>
To:     Wenchao Hao <haowenchao@...wei.com>,
        "James E . J . Bottomley" <jejb@...ux.ibm.com>,
        "Martin K . Petersen" <martin.petersen@...cle.com>,
        linux-scsi@...r.kernel.org, linux-kernel@...r.kernel.org,
        James Smart <james.smart@...adcom.com>,
        Dick Kennedy <dick.kennedy@...adcom.com>,
        Nilesh Javali <njavali@...vell.com>,
        GR-QLogic-Storage-Upstream@...vell.com,
        Hannes Reinecke <hare@...e.de>,
        Martin Wilck <martin.wilck@...e.com>,
        linux-s390 <linux-s390@...r.kernel.org>,
        Benjamin Block <bblock@...ux.ibm.com>,
        Ming Lei <ming.lei@...hat.com>
Cc:     Zhiqiang Liu <liuzhiqiang26@...wei.com>,
        Feilong Lin <linfeilong@...wei.com>, Wu Bo <wubo40@...wei.com>
Subject: Re: [PATCH] scsi: Do not break scan luns loop if add single lun
 failed

On 12/26/21 00:29, Wenchao Hao wrote:
> Failed to add a single lun does not mean all luns are unaccessible,
> if we break the scan luns loop, the other luns reported by REPORT LUNS
> command would not be probed any more.
> 
> In this case, we might loss some luns which are accessible.

Could you please add more details about the specific use case, where this 
actually was a problem, for my understanding?

> 
> Signed-off-by: Wenchao Hao <haowenchao@...wei.com>
> ---
>   drivers/scsi/scsi_scan.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 23e1c0acdeae..fee7ce082103 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1476,13 +1476,13 @@ static int scsi_report_lun_scan(struct scsi_target *starget, blist_flags_t bflag
>   				lun, NULL, NULL, rescan, NULL);
>   			if (res == SCSI_SCAN_NO_RESPONSE) {
>   				/*
> -				 * Got some results, but now none, abort.
> +				 * Got some results, but now none, abort this lun

abort => skip ?

>   				 */
>   				sdev_printk(KERN_ERR, sdev,
>   					"Unexpected response"
>   					" from lun %llu while scanning, scan"
>   					" aborted\n", (unsigned long long)lun);

That message would no longer be correct with your change, as it would not abort 
the scan any more.

> -				break;
> +				continue;
>   			}
>   		}
>   	}


Wouldn't this change existing semantics for LLDDs intentionally returning 
-ENXIO from their slave_alloc() callback in certain cases?:


> static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
...
> 	if (shost->hostt->slave_alloc) {
> 		ret = shost->hostt->slave_alloc(sdev);
> 		if (ret) {
> 			/*
> 			 * if LLDD reports slave not present, don't clutter
> 			 * console with alloc failure messages
> 			 */
> 			if (ret == -ENXIO)
> 				display_failure_msg = 0;
> 			goto out_device_destroy;
...
> out_device_destroy:
> 	__scsi_remove_device(sdev);
> out:
> 	if (display_failure_msg)
> 		printk(ALLOC_FAILURE_MSG, __func__);
> 	return NULL;


scsi_probe_and_add_lun() [such as called by scsi_report_lun_scan() for the case 
at hand] converts this case into a SCSI_SCAN_NO_RESPONSE return value.

> static int scsi_probe_and_add_lun(struct scsi_target *starget,
...
> 	int res = SCSI_SCAN_NO_RESPONSE, result_len = 256;
...
> 		sdev = scsi_alloc_sdev(starget, lun, hostdata);
> 	if (!sdev)
> 		goto out;
...
>  out:
> 	return res;


Such as being used by zfcp:

> static int zfcp_scsi_slave_alloc(struct scsi_device *sdev)
> {
...
> 	unit = zfcp_unit_find(port, zfcp_scsi_dev_lun(sdev));
> 	if (unit)
> 		put_device(&unit->dev);
> 
> 	if (!unit && !(allow_lun_scan && npiv)) {
> 		put_device(&port->dev);
> 		return -ENXIO;
                        ^^^^^^

which implements an initiator-based LUN masking that is necessary for shared 
HBAs virtualized without NPIV.
https://www.ibm.com/docs/en/linux-on-systems?topic=devices-manually-configured-fcp-luns

While things might still work, as zfcp now "just" gets (much) more callbacks to 
slave_alloc() it has to end with -ENXIO, the user may get flooded with the 
error(!) sdev_printk on "Unexpected response from LUN ..." in 
scsi_report_lun_scan().
In the worst case, we could get this message now 64k - 1 times in a zfcp 
scenario connected to IBM DS8000 storage being able to map (all) 64k volumes to 
a single initiator (HBA), where the user via zfcp sysfs decided to use only the 
first lun reported (for the vHBA).

Other LLLDs also seem to intentionally return -ENXIO from slave_alloc() 
callbacks, such as but not limited to lpfc or qla2xxx:

> int fc_slave_alloc(struct scsi_device *sdev)
> {
> 	struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
> 
> 	if (!rport || fc_remote_port_chkready(rport))
> 		return -ENXIO;

> static int
> qla2xxx_slave_alloc(struct scsi_device *sdev)
> {
> 	struct fc_rport *rport = starget_to_rport(scsi_target(sdev));
> 
> 	if (!rport || fc_remote_port_chkready(rport))
> 		return -ENXIO;


-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294