lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <CBH8K74TF8IQ.2KUOIGFJ7K8XP@nagato>
Date:   Wed, 19 May 2021 07:54:19 -0500
From:   "Robert Straw" <drbawb@...alsyntax.com>
To:     "Christoph Hellwig" <hch@...radead.org>
Cc:     "Bjorn Helgaas" <helgaas@...nel.org>, <bhelgaas@...gle.com>,
        <linux-pci@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        "Alex Williamson" <alex.williamson@...hat.com>
Subject: Re: [PATCH] pci: add NVMe FLR quirk to the SM951 SSD

On Wed May 19, 2021 at 3:44 AM CDT, Christoph Hellwig wrote:
> On Sat, May 15, 2021 at 12:20:05PM -0500, Robert Straw wrote:
> While it doesn't matter here, NVMe 1.1 is very much out of data, being
> a more than 8 year old specification. The current version is 1.4b,
> with NVMe 2.0 about to be released.

I can't comment on 2.0, but yes 1.4b has the same aside regarding undefined
behavior on the SHST field (on p. 50). The only reason I was looking at
1.1a is because it's specifically listed on the datasheet for the SM951.
(The device under test.)

> No, we don't. This is a bug particular to a specific implementation.
> In fact the whole existing NVMe shutdown before reset quirk is rather
> broken and dangerous, as it concurrently accesses the NVMe registers
> with the actual driver, which could be trivially triggered through the
> sysfs reset attribute.

I'm not exactly clear in what way the nvme driver would  be racing against 
vfio-pci here. (a) vfio-pci is the driver bound in this scenario, and (b)
the vfio-pci driver triggers this quirk by issuing an FLR, which is done 
with the device locked. (e.g: vfio_pci.c:499.)

In my testing *without this patch* vfio-pci is still bound to the device 
for at least 60s after guest shutdown, at which point the FLR times out.
After this FLR the device is useless w/o a full reboot of the host. 
Rebinding it to *either* another guest w/ vfio-pci, or the Linux nvme 
driver doesn't matter: as the device can no longer be reconfigured.

As I understand it: vfio-pci should not blindly issue an FLR to an NVMe class 
device w/o obeying the protocol. The protocol seems clear that after a 
shutdown CC->EN must transition from 1 to 0. (I would argue the guest OS 
leaving the device in this state is the actual violation of the spec. As 
I'm unable to change that behavior: having vfio-pci clean up the mess w/ 
this quirk seems to be an adequate workaround.)

I am currently testing  a version of this patch that only disables the
controller if the device has been previously shutdown. I am trying to
gauge whether this would be preferable to just blanket-disabling these 
bugged devices before relinquishing control of them back to the host.

> I'd much rather quirk these broken Samsung drivers to not allow
> assigning them to VFIO.

I'd much rather keep using my storage devices. I will leave the 
quirk limited to these known-bugged devices.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ