linux-kernel - RE: [PATCH] scsi: storvsc: Fix a panic in the hibernation procedure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <HK0P153MB02737524F120829405C6DE68BFD30@HK0P153MB0273.APCP153.PROD.OUTLOOK.COM>
Date:   Thu, 23 Apr 2020 07:04:00 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Bart Van Assche <bvanassche@....org>,
        "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "hch@....de" <hch@....de>, "hare@...e.de" <hare@...e.de>,
        Michael Kelley <mikelley@...rosoft.com>,
        Long Li <longli@...rosoft.com>,
        "ming.lei@...hat.com" <ming.lei@...hat.com>,
        Balsundar P <Balsundar.P@...rochip.com>
CC:     "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        KY Srinivasan <kys@...rosoft.com>
Subject: RE: [PATCH] scsi: storvsc: Fix a panic in the hibernation procedure

> From: Bart Van Assche <bvanassche@....org>
> Sent: Wednesday, April 22, 2020 12:56 PM
> On 4/21/20 11:24 PM, Dexuan Cui wrote:
> > Upon suspend, I suppose the other LLDs can not accept I/O either, then
> > what do they do when some kernel thread still tries to submit I/O? Do
> > they "block" the I/O until resume, or just return an error immediately?
> 
> This is my understanding of how other SCSI LLDs handle suspend/resume:
> - The ULP driver, e.g. the SCSI sd driver, implements power management
> support by providing callbacks in struct scsi_driver.gendrv.pm and also
> in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device
> cache and send a STOP command to the device.

It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
from the file system that lives in some partition of some disk.

The panic I'm seeing is not from sd. I think it's from a kernel thread
that tries to detect the status of the SCSI CDROM. This is the snipped
messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here
the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
and later the storvsc LLD's suspend callback is also called, but
sr_block_check_events() can still try to submit SCSI commands to storvsc:

[   11.668741] sr 0:0:0:1: bus quiesce
[   11.668804] sd 0:0:0:0: bus quiesce
[   11.698082] scsi target0:0:0: bus quiesce
[   11.703296] scsi host0: bus quiesce
[   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce
[   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce
[   11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090
[   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
[   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc]
[   11.804996] Call Trace:
[   11.804996]  scsi_queue_rq+0x593/0xa10
[   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
[   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
[   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
[   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
[   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
[   11.804996]  blk_execute_rq+0x4b/0xa0
[   11.804996]  __scsi_execute+0xeb/0x250
[   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
[   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
[   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
[   11.804996]  disk_check_events+0x68/0x160
[   11.804996]  process_one_work+0x20c/0x3d0
[   11.804996]  worker_thread+0x2d/0x3e0
[   11.804996]  kthread+0x10c/0x130
[   11.804996]  ret_from_fork+0x35/0x40

It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
scsi_device_quiesce() does not guarantee the device is totally quiescent:

/**
 *      scsi_device_quiesce - Block user issued commands.
 *      @sdev:  scsi device to quiesce.
 *
 *      This works by trying to transition to the SDEV_QUIESCE state
 *      (which must be a legal transition).  When the device is in this
 *      state, only special requests will be accepted, all others will
 *      be deferred.  Since special requests may also be requeued requests,
 *      a successful return doesn't guarantee the device will be
 *      totally quiescent.

I guess the "special requests" may include the "detect the status of the
SCSI CDROM" request from sr_block_check_events().

It looks scsi_device_quiesce() does not freeze the queue and it just puts the
device to the SDEV_QUIESCE state, which is not enough to prevent
sr_block_check_events from submitting SCSI commands.

It looks scsi_host_block() freezes the queue and flushes all the pending I/O,
so it can fix the aforementioned NULL pointer deference panic for me.

> - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and
> resume callbacks. These callbacks can be used to make the PCIe device
> enter/leave a power saving state. No new SCSI commands should be
> submitted after pci_driver.suspend has been called.
> 
> > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements
> > a mechanism of marking the device busy upon suspend, so any I/O will
> > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the
> > disadvantage is: the mechanism of marking the device busy looks
> > complex, and the hot path .queuecommand() has to take the
> > spinlock shost->host_lock, which should affect the performance.
> 
> I think the code you are referring to is the code in
> scsifront_suspend(). A pointer to that function is stored in a struct
> xenbus_driver instance. That's another structure than the structures
> mentioned above.
> 
> Wouldn't it be better to make sure that any hypervisor suspend
> operations happen after it is guaranteed that no further SCSI commands
> will be submitted such that hypervisor suspend operations do not have to
> deal with SCSI commands submitted during or after the hypervisor suspend
> callback?

I agree with you, but it looks scsi_bus_freeze() doesn't guarantee no
further SCSI commands will be submitted, as I described above. Maybe that
is why scsifront_suspend() has to invent a mechanism to cope with the issue.
 
> > It looks drivers/scsi/nsp32.c: nsp32_suspend() and
> > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O
> > after suspend. I doubt this is correct.
> 
> nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be
> submitted after that callback has started that would mean that the SCSI
> suspend and PCIe suspend operations are called in the wrong order. I do
> not agree that code for suspending SCSI commands should be added in
> nsp32_suspend().
> 
> > So it looks to me there is no simple mechanism to handle the scenario
> > here, and I guess that's why the scsi_host_block/unblock APIs are
> > introduced, and actually there is already an user of the APIs:
> > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O").
> >
> > The aacraid patch says: "This has the advantage that the block layer will
> > stop sending I/O to the adapter instead of having the SCSI midlayer
> > requeueing I/O internally". It looks this may imply that using the new
> > APIs is encouraged?
> 
> I'm fine with using these new functions in device reset handlers. Using
> these functions in power management handlers seems wrong to me.

It looks you're suggesting the scsi_host_block() in aac_suspend() should
be removed? :-)

> > PS, here storvsc has to destroy and re-construct the I/O queues: the
> > I/O queues are shared memory ringbufers between the guest and the
> > host; in the resume path of the hibernation procedure, the memory
> > pages allocated by the 'new' kernel is different from that allocated by
> > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel
> > must destroy the mapping of the pages, and later after we jump to
> > the 'old' kernel, we'll re-create the mapping using the pages allocated
> > by the 'old' kernel. Here "create the mapping" means the guest tells
> > the host about the physical addresses of the pages.
> 
> Thank you for having clarified this. This helps me to understand the HV
> driver framework better. I think this means that the hv_driver.suspend
> function should be called at a later time than SCSI suspend. From

I agree, but it looks here scsi_bus_freeze() doesn't work as we expected?

> Documentation/driver-api/device_link.rst: "By default, the driver core
> only enforces dependencies between devices that are borne out of a
> parent/child relationship within the device hierarchy: When suspending,
> resuming or shutting down the system, devices are ordered based on this
> relationship, i.e. children are always suspended before their parent,
> and the parent is always resumed before its children." Is there a single
> storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has

Yes.

> it been considered to make storvsc_drv the parent device of all SCSI
> devices created by the storvsc driver?
> 
> Thanks,
> 
> Bart.

Yes, I think so:

root@...alhost:~# ls -rtl  /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver
lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd

Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943
is storvsc, which creates host3/target3:0:0/3:0:0:0.

So it looks there is no ordering issue.

Thanks,
-- Dexuan