linux-kernel - Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190228141655.GA18319@infradead.org>
Date:   Thu, 28 Feb 2019 06:16:55 -0800
From:   Christoph Hellwig <hch@...radead.org>
To:     Austin.Bolen@...l.com
Cc:     Alex_Gagniuc@...lteam.com, torvalds@...ux-foundation.org,
        keith.busch@...el.com, sagi@...mberg.me,
        linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
        axboe@...com, mr.nuke.me@...il.com, hch@....de,
        jonathan.derrick@...el.com
Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Wed, Feb 27, 2019 at 08:04:35PM +0000, Austin.Bolen@...l.com wrote:
> Confirmed this issue does not apply to the referenced Dell servers so I 
> don't not have a stake in how this should be handled for those systems. 
> It may be they just don't support surprise removal.  I know in our case 
> all the Linux distributions we qualify (RHEL, SLES, Ubuntu Server) have 
> told us they do not support surprise removal.  So I'm guessing that any 
> issues found with surprise removal could potentially fall under the 
> category of "unsupported".
> 
> Still though, the larger issue of recovering from other types of PCIe 
> errors that are not due to device removal is still important.  I would 
> expect many system from many platform makers to not be able to recover 
> PCIe errors in general and hopefully the new DPC CER model will help 
> address this and provide added protection for cases like above as well.

FYI, a related issue I saw about a year two ago with Dell servers was
with a dual ported NVMe add-in (non U.2) card, is that once you did
a subsystem reset, which would cause both controller to retrain the link
you'd run into Firmware First error handling issue that would instantly
crash the system.  I don't really have the hardware anymore, but the
end result was that I think the affected product ended up shipping
with subsystem resets only enabled for the U.2 form factor.