linux-kernel - Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190227175119.GB19455@localhost.localdomain>
Date:   Wed, 27 Feb 2019 10:51:20 -0700
From:   Keith Busch <kbusch@...nel.org>
To:     Alex_Gagniuc@...lteam.com
Cc:     torvalds@...ux-foundation.org, axboe@...com, sagi@...mberg.me,
        linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
        keith.busch@...el.com, mr.nuke.me@...il.com, hch@....de,
        jonathan.derrick@...el.com
Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Wed, Feb 27, 2019 at 04:42:05PM +0000, Alex_Gagniuc@...lteam.com wrote:
> On 2/26/19 7:02 PM, Linus Torvalds wrote:
> > On Tue, Feb 26, 2019 at 2:37 PM <Alex_Gagniuc@...lteam.com> wrote:
> >>
> >> Then nobody gets the (error) message. You can go a bit further and try
> >> 'pcie_ports=native". Again, nobody gets the memo. ):
> > 
> > So? The error was bogus to begin with. Why would we care?
> 
> Of course nobody cares about that. We care about actual errors that we 
> now know we won't be notified of. Imagine if we didn't get the memo that 
> a piece of data is corrupt, and imagine the reaction of RAS folk.
> 
> And I know the counter to that is a panic() is much more likely to cause 
> data corruption, and we're trading one piece of crap for an even 
> stinkier one. Whatever we end up doing, we have to do better than 
> silence errors and pretend nothing happened.
> 
> 
> > Yes, yes, PCI bridges have the ability to return errors in accesses to
> > non-existent devices. But that was always bogus, and is never useful.
> > The whole "you get an interrupt or NMI on a bad access" is simply a
> > horribly broken model. It's not useful.
> > 
> > We already have long depended on hotplug drivers noticing the "oh, I'm
> > getting all-ff returns, the device may be gone". It's usually trivial,
> > and works a whole lot better.
> 
> And that's been working great, hasn't it? I think you're thinking 
> strictly about hotplug. There are other situations where things are all 
> F'd, but the hardware isn't sending all F's. (example: ECRC errors)
> 
> 
> > It's not an error. Trying to force it to be an NMI or SCI or machine
> > check is bogus. It causes horrendous pain, because asynchronous
> > reporting doesn't work reliably anyway, and *synchronous* reporting is
> > impossible to sanely handle without crazy problems.
> > 
> > So the only sane model for hotplug devices is "IO still works, and
> > returns all ones". Maybe with an async one-time and *recoverable*
> > machine check or other reporting the access after the fact.
> 
> Exactly!!! A notification (not calling it an 'error') that something 
> unusual has happened is good. Treating these things like errors is so 
> obvious, even a caveman wouldn't do it.
> In a world with FFS, we don't always get to have that model. Oh, FFS!
> 
> 
> > Anything else is simply broken. It would be broken even if firmware
> > wasn't involved, but obviously firmware people tend to often make a
> > bad situation even worse.
> 
> Linus, be nice to firmware people. I've met a few, and I can vouch that 
> they're very kind and nice. They're also very scared, especially when OS 
> people want to ask them a few questions.
> 
> I think FFS should get out of the way when OS advertises it's capable of 
> handling XYZ. There are some good arguments why this hasn't happened, 
> but I won't get into details. I do think it's unlikely that machines 
> will be moving back to an OS-controlled model.
> 
> And Linus, keep in mind, when these machines were developed, OSes 
> couldn't handle recovery properly. None of this was ever an issue. It's 
> our fault that we've changed the OS after the machines are on the market.
> 
> Alex

I can't tell where you're going with this. It doesn't sound like you're
talking about hotplug anymore, at least.