linux-kernel - Re: [PATCH 0/2] PCI: Workaround for bus reset on Cavium cn8xxx root ports

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170523212000.GA20928@bhelgaas-glaptop.roam.corp.google.com>
Date:   Tue, 23 May 2017 16:20:00 -0500
From:   Bjorn Helgaas <helgaas@...nel.org>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     David Daney <david.daney@...ium.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org,
        Jon Masters <jcm@...hat.com>,
        Robert Richter <robert.richter@...ium.com>,
        linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH 0/2] PCI: Workaround for bus reset on Cavium cn8xxx root
 ports

On Tue, May 23, 2017 at 03:04:04PM -0600, Alex Williamson wrote:
> On Tue, 23 May 2017 15:47:50 -0500
> Bjorn Helgaas <helgaas@...nel.org> wrote:
> 
> > On Mon, May 15, 2017 at 05:17:34PM -0700, David Daney wrote:
> > > With the recent improvements in arm64 and vfio-pci, we are seeing
> > > failures like this (on cn8890 based systems):
> > > 
> > > [  235.622361] Unhandled fault: synchronous external abort (0x96000210) at 0xfffffc00c1000100
> > > [  235.630625] Internal error: : 96000210 [#1] PREEMPT SMP
> > > .
> > > .
> > > .
> > > [  236.208820] [<fffffc0008411250>] pci_generic_config_read+0x38/0x9c
> > > [  236.214992] [<fffffc0008435ed4>] thunder_pem_config_read+0x54/0x1e8
> > > [  236.221250] [<fffffc0008411620>] pci_bus_read_config_dword+0x74/0xa0
> > > [  236.227596] [<fffffc000841853c>] pci_find_next_ext_capability.part.15+0x40/0xb8
> > > [  236.234896] [<fffffc0008419428>] pci_find_ext_capability+0x20/0x30
> > > [  236.241068] [<fffffc0008423e2c>] pci_restore_vc_state+0x34/0x88
> > > [  236.246979] [<fffffc000841af3c>] pci_restore_state.part.37+0x2c/0x1fc
> > > [  236.253410] [<fffffc000841b174>] pci_dev_restore+0x4c/0x50
> > > [  236.258887] [<fffffc000841b19c>] pci_bus_restore+0x24/0x4c
> > > [  236.264362] [<fffffc000841c2dc>] pci_try_reset_bus+0x7c/0xa0
> > > [  236.270021] [<fffffc00060a1ab0>] vfio_pci_ioctl+0xc34/0xc3c [vfio_pci]
> > > [  236.276547] [<fffffc0005eb0410>] vfio_device_fops_unl_ioctl+0x20/0x30 [vfio]
> > > [  236.283587] [<fffffc000824b314>] do_vfs_ioctl+0xac/0x744
> > > [  236.288890] [<fffffc000824ba30>] SyS_ioctl+0x84/0x98
> > > [  236.293846] [<fffffc0008082ca0>] __sys_trace_return+0x0/0x4
> > > 
> > > These are caused by the inability of the PCIe root port and Intel
> > > e1000e to sucessfully do a bus reset.
> > > 
> > > The proposed fix is to not do a bus reset on these systems.
> > > 
> > > David Daney (2):
> > >   PCI: Allow PCI_DEV_FLAGS_NO_BUS_RESET to be used on bus device.
> > >   PCI: Avoid bus reset for Cavium cn8xxx root ports.
> > > 
> > >  drivers/pci/pci.c    | 4 ++++
> > >  drivers/pci/quirks.c | 8 ++++++++
> > >  2 files changed, 12 insertions(+)  
> > 
> > Applied with Eric's reviewed-by and typo fixes to pci/virtualization for
> > v4.13, thanks!
> 
> Hmm, well let me again express my concerns that I'm really not sure how
> to support this since it removes our last opportunity to reset devices
> that may otherwise have no reset mechanism.  Certain classes of devices
> are entirely unsupportable for the code path indicated above without a
> bus reset.  If we have an endpoint device that goes bonkers at a bus
> reset, at least we know it's going to behave just as poorly no matter
> what the host platform.  This series allows endpoints that work
> perfectly well on one host to be handled differently on another.  It
> certainly suggests something non-spec compliant about the root port
> implementation and I wish there was more analysis about exactly what
> that problem is since this is coming from the hardware vendor.
> 
> https://lkml.org/lkml/2017/5/16/662

I almost poked you about this on IRC; guess I should have :)

Is it better to leave it as-is, and just take the aborts David
reported?

I agree, it would be nice to know what's really going on.  I assume
Cavium is interested in that as well to make sure future parts don't
have the issue.

Bjorn