linux-kernel - Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180416031726.GB158153@bhelgaas-glaptop.roam.corp.google.com>
Date:   Sun, 15 Apr 2018 22:17:26 -0500
From:   Bjorn Helgaas <helgaas@...nel.org>
To:     Sinan Kaya <okaya@...eaurora.org>
Cc:     Keith Busch <keith.busch@...el.com>,
        Oza Pawandeep <poza@...eaurora.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Philippe Ombredanne <pombredanne@...b.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Kate Stewart <kstewart@...uxfoundation.org>,
        linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        Dongdong Liu <liudongdong3@...wei.com>,
        Wei Zhang <wzhang@...com>, Timur Tabi <timur@...eaurora.org>,
        Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled
 system

On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote:

> You indicated that you want to unify the AER and DPC behavior. Let's
> settle on what we want to do one more time. We have been going forth
> and back on the direction.

My thinking is that as much as possible, similar events should be
handled similarly, whether the mechanism is AER, DPC, EEH, etc.
Ideally, drivers shouldn't have to be aware of which mechanism is in
use.

Error recovery includes conventional PCI as well, but right now I
think we're only concerned with PCIe.  The following error types are
from PCIe r4.0, sec 6.2.2:

  ERR_COR
    Corrected by hardware with no software intervention.  Software
    involved for logging only.

    Handled by AER via pci_error_handlers; DPC is never involved.

    Link is unaffected.

  ERR_NONFATAL
    A transaction is unreliable but the link is fully functional.

    If DPC is not supported, handled by AER via pci_error_handlers and
    the link is unaffected.

    If DPC supported, handled by DPC (because we set
    PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate.

  ERR_FATAL
    The link is unreliable.

    If DPC is not supported, handled by AER via pci_error_handlers and
    the link is reset.

    If DPC supported, handled by DPC via remove/re-enumerate.

It doesn't seem right to me that we handle both ERR_NONFATAL and
ERR_FATAL events differently if we happen to have DPC support in a
switch.

Maybe we should consider triggering DPC only on ERR_FATAL?  That would
keep DPC out of the ERR_NONFATAL cases.

For ERR_FATAL, maybe we should bite the bullet and use
remove/re-enumerate for AER as well as for DPC.  That would be painful
for higher-level software, but if we're willing to accept that pain
for new systems that support DPC, maybe life would be better overall
if it worked the same way on systems without DPC?

Bjorn