linux-kernel - Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220313214314.GD182809@otc-nc-03>
Date:   Sun, 13 Mar 2022 14:43:14 -0700
From:   "Raj, Ashok" <ashok.raj@...el.com>
To:     Bjorn Helgaas <helgaas@...nel.org>
Cc:     Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Russell Currey <ruscur@...sell.cc>,
        Oliver OHalloran <oohall@...il.com>,
        Kuppuswamy Sathyanarayanan <knsathya@...nel.org>,
        linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
        Eric Badger <ebadger@...estorage.com>,
        linuxppc-dev@...ts.ozlabs.org, Ashok Raj <ashok.raj@...el.com>
Subject: Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable
 errors properly

On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > assumption is incorrect.
> > 
> > Consider a scenario where aer_irq() is triggered for a correctable
> > error, and while we process the error and before we clear the error
> > status in "Root Error Status" register, if the same kind of error
> > is triggered again, since aer_irq() only clears events it saw, the
> > multi-bit error is left in tact. This will cause the interrupt to fire
> > again, resulting in entering aer_irq() with just the multi-bit error
> > logged in the "Root Error Status" register.
> > 
> > Repeated AER recovery test has revealed this condition does happen
> > and this prevents any new interrupt from being triggered. Allow to
> > process interrupt even if only multi-correctable (BIT 1) or
> > multi-uncorrectable bit (BIT 3) is set.
> > 
> > Reported-by: Eric Badger <ebadger@...estorage.com>
> 
> Is there a bug report with any concrete details (dmesg, lspci, etc)
> that we can include here?

Eric might have more details to add when he collected numerous logs to get
to the timeline of the problem. The test was to stress the links with an
automated power off, this will result in some eDPC UC error followed by
link down. The recovery worked fine for several cycles and suddenly there
were no more interrupts. A manual rescan on pci would probe and device is
operational again.

The test patch revealed we entered the aer_irq() with just the multi-error
PCI_ERR_ROOT_MULTI_COR_RCV or PCI_ERR_ROOT_MULTI_UNCOR_RCV, then we didn't
clear those bits causing interrupt generation to cease after that.

Cheers,
Ashok