linux-kernel - RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR21MB15930820BBB37CF66D991CB9D7929@MWHPR21MB1593.namprd21.prod.outlook.com>
Date:   Tue, 9 Mar 2021 15:45:15 +0000
From:   Michael Kelley <mikelley@...rosoft.com>
To:     melanieplageman <melanieplageman@...il.com>
CC:     "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "andres@...razel.de" <andres@...razel.de>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        KY Srinivasan <kys@...rosoft.com>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>
Subject: RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

From: Melanie Plageman <melanieplageman@...il.com> Sent: Monday, March 8, 2021 9:56 AM
> 
> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote:
> > From: Melanie Plageman (Microsoft) <melanieplageman@...il.com> Sent: Friday, March
> 5, 2021 3:22 PM
> > >
> > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> > > allocation.
> > >
> > > Cap cmd_per_lun at can_queue to avoid dispatch errors.
> > >
> > > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@...il.com>
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index 6bc5453cea8a..d7953a6e00e6 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> > >  				(max_sub_channels + 1) *
> > >  				(100 - ring_avail_percent_lowater) / 100;
> > >
> > > +	scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> scsi_driver.can_queue);
> > > +
> >
> > I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?
> 
> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> Scsi_Host->cmd_per_lun in storvsc_probe().
> 
> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> called and sets the scsi_device->queue_depth to the Scsi_Host's
> cmd_per_lun with this code:
> 
> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
>                                         sdev->host->cmd_per_lun : 1);
> 
> During dispatch, the scsi_device->queue_depth is used in
> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> whether or not the device can queue another command.
> 
> On some machines, with the 2048 value of cmd_per_lun that was used to
> set the initial scsi_device->queue_depth, commands can be queued that
> are later not able to be dispatched after running out of space in the
> ringbuffer.
> 
> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> (running an fio workload that I can provide if needed), storvsc_do_io()
> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> 
> This is the call stack:
> 
> hv_get_bytes_to_write
> hv_ringbuffer_write
> vmbus_send_packet
> storvsc_dio_io
> storvsc_queuecommand
> scsi_dispatch_cmd
> scsi_queue_rq
> dispatch_rq_list
> 
> > Be aware that the calculation of "can_queue" in this driver is somewhat
> > flawed -- it should not be based on the size of the ring buffer, but instead on
> > the maximum number of requests Hyper-V will queue.  And even then,
> > can_queue doesn't provide the cap you might expect because the blk-mq layer
> > allocates can_queue tags for each HW queue, not as a total.
> 
> 
> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> 
>   can_queue
>   - must be greater than 0; do not send more than can_queue
>     commands to the adapter.
> 
> I did notice that in scsi_host.h, the comment for can_queue does say
> can_queue is the "maximum number of simultaneous commands a single hw
> queue in HBA will accept." 

Yes, this comment is correct.  The can_queue value is per HW queue.

> However, I don't see it being used this way
> in the code.
> 
> During dispatch, In scsi_target_queue_ready(), there is this code:
> 
>         if (busy >= starget->can_queue)
>                 goto starved;
> 
> And the scsi_target->can_queue value should be coming from Scsi_host as
> mentioned in the scsi_target definition in scsi_device.h
>     /*
>       * LLDs should set this in the slave_alloc host template callout.
>       * If set to zero then there is not limit.
>       */
>     unsigned int            can_queue;
> 
> So, I don't really see how this would be per hardware queue.

For the storvsc driver, the can_queue value in the scsi_target is initialized
to zero in scsi_alloc_target() and it remains unchanged.  Maybe I'm missing
something, but the only place I see that sets starget->can_queue to a
non-zero value is iscsi_target_alloc().  The storvsc slave_alloc() function does
not set it.  So the test in scsi_target_queue_ready() for exceeding can_queue
never executes.

We've run live tests, and can see that the number of requests sent to the
storvsc driver exceeds the can_queue value when the # of HW queues is
greater than 1.  That result is consistent with what I see in the code.

Michael

> 
> >
> > I agree that the cmd_per_lun setting is also too big, but we should fix that in
> > the context of getting all of these different settings working together correctly,
> > and not piecemeal.
> >
> 
> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
> will also prevent the LUN queue_depth from being set to a value that is
> higher than it can ever be set to again by the user when
> storvsc_change_queue_depth() is invoked.
> 
> Also in scsi_sysfs sdev_store_queue_depth() there is this check:
> 
>           if (depth < 1 || depth > sdev->host->can_queue)
>                 return -EINVAL;
> 
> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
> is set to the min of the configured cmd_per_lun and
> Scsi_Host->can_queue:
> 
>     shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
> 
> Best,
> Melanie