linux-kernel - Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <940d608e1a044a54abcb9d65923951f3@ausx13mps317.AMER.DELL.COM>
Date:   Wed, 27 Feb 2019 16:42:05 +0000
From:   <Alex_Gagniuc@...lteam.com>
To:     <torvalds@...ux-foundation.org>
CC:     <keith.busch@...el.com>, <axboe@...com>, <sagi@...mberg.me>,
        <linux-kernel@...r.kernel.org>, <linux-nvme@...ts.infradead.org>,
        <mr.nuke.me@...il.com>, <hch@....de>, <jonathan.derrick@...el.com>
Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/26/19 7:02 PM, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 2:37 PM <Alex_Gagniuc@...lteam.com> wrote:
>>
>> Then nobody gets the (error) message. You can go a bit further and try
>> 'pcie_ports=native". Again, nobody gets the memo. ):
> 
> So? The error was bogus to begin with. Why would we care?

Of course nobody cares about that. We care about actual errors that we 
now know we won't be notified of. Imagine if we didn't get the memo that 
a piece of data is corrupt, and imagine the reaction of RAS folk.

And I know the counter to that is a panic() is much more likely to cause 
data corruption, and we're trading one piece of crap for an even 
stinkier one. Whatever we end up doing, we have to do better than 
silence errors and pretend nothing happened.

> Yes, yes, PCI bridges have the ability to return errors in accesses to
> non-existent devices. But that was always bogus, and is never useful.
> The whole "you get an interrupt or NMI on a bad access" is simply a
> horribly broken model. It's not useful.
> 
> We already have long depended on hotplug drivers noticing the "oh, I'm
> getting all-ff returns, the device may be gone". It's usually trivial,
> and works a whole lot better.

And that's been working great, hasn't it? I think you're thinking 
strictly about hotplug. There are other situations where things are all 
F'd, but the hardware isn't sending all F's. (example: ECRC errors)

> It's not an error. Trying to force it to be an NMI or SCI or machine
> check is bogus. It causes horrendous pain, because asynchronous
> reporting doesn't work reliably anyway, and *synchronous* reporting is
> impossible to sanely handle without crazy problems.
> 
> So the only sane model for hotplug devices is "IO still works, and
> returns all ones". Maybe with an async one-time and *recoverable*
> machine check or other reporting the access after the fact.

Exactly!!! A notification (not calling it an 'error') that something 
unusual has happened is good. Treating these things like errors is so 
obvious, even a caveman wouldn't do it.
In a world with FFS, we don't always get to have that model. Oh, FFS!

> Anything else is simply broken. It would be broken even if firmware
> wasn't involved, but obviously firmware people tend to often make a
> bad situation even worse.

Linus, be nice to firmware people. I've met a few, and I can vouch that 
they're very kind and nice. They're also very scared, especially when OS 
people want to ask them a few questions.

I think FFS should get out of the way when OS advertises it's capable of 
handling XYZ. There are some good arguments why this hasn't happened, 
but I won't get into details. I do think it's unlikely that machines 
will be moving back to an OS-controlled model.

And Linus, keep in mind, when these machines were developed, OSes 
couldn't handle recovery properly. None of this was ever an issue. It's 
our fault that we've changed the OS after the machines are on the market.

Alex