linux-kernel - Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250806185051.GA10150@bhelgaas>
Date: Wed, 6 Aug 2025 13:50:51 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Jim Quinlan <james.quinlan@...adcom.com>
Cc: linux-pci@...r.kernel.org, Nicolas Saenz Julienne <nsaenz@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Lorenzo Pieralisi <lorenzo.pieralisi@....com>,
	Cyril Brulebois <kibi@...ian.org>,
	bcm-kernel-feedback-list@...adcom.com, jim2101024@...il.com,
	Florian Fainelli <florian.fainelli@...adcom.com>,
	Lorenzo Pieralisi <lpieralisi@...nel.org>,
	Krzysztof Wilczyński <kwilczynski@...nel.org>,
	Manivannan Sadhasivam <mani@...nel.org>,
	Rob Herring <robh@...nel.org>,
	"moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE" <linux-rpi-kernel@...ts.infradead.org>,
	"moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE" <linux-arm-kernel@...ts.infradead.org>,
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

On Wed, Aug 06, 2025 at 02:38:12PM -0400, Jim Quinlan wrote:
> On Wed, Aug 6, 2025 at 2:15 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> >
> > On Fri, Jun 13, 2025 at 06:08:43PM -0400, Jim Quinlan wrote:
> > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> > > by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
> > > 7216 and its descendants -- have new HW that identifies error details.
> >
> > What's the long term plan for this?  This abort is a huge problem that
> > we're seeing across arm64 platforms.  Forcing a panic and reboot for
> > every uncorrectable error is pretty hard to deal with.
> 
> Are you referring to STB/CM systems, Rpi, or something else altogether?

Just in general.  I saw this recently with a Nuvoton NPCM8xx PCIe
controller.  I'm not an arm64 guy, but I've been told that these
aborts are basically unrecoverable from a kernel perspective.  For
some reason several PCIe controllers intended for arm64 seem to raise
aborts on PCIe errors.  At the moment, that means we can't recover
from errors like surprise unplugs and other things that *should* be
recoverable (perhaps at the cost of resetting or disabling a PCIe
device).

> > Is there a plan to someday recover from these aborts?  Or change the
> > hardware so it can at least be configured to return ~0 data after
> > logging the error in the hardware registers?
> 
> Some of our upcoming chips will have the ability to do nothing on
> errant PCIe writes and return 0xffffffff on errant PCIe reads.   But
> none of our STB/CM chips do this currently.   I've been asking for
> this behavior for years but I have limited influence on what happens
> in HW.

Fingers crossed for either that or some other way to make these things
recoverable.

> > > This simple handler determines if the PCIe controller was the
> > > cause of the abort and if so, prints out diagnostic info.
> > > Unfortunately, an abort still occurs.
> > >
> > > Care is taken to read the error registers only when the PCIe
> > > bridge is active and the PCIe registers are acceptable.
> > > Otherwise, a "die" event caused by something other than the PCIe
> > > could cause an abort if the PCIe "die" handler tried to access
> > > registers when the bridge is off.
> >
> > Checking whether the bridge is active is a "mostly-works"
> > situation since it's always racy.
> 
> I'm not sure I understand the "racy" comment.  If the PCIe bridge is
> off, we do not read the PCIe error registers.  In this case, PCIe is
> probably not the cause of the panic.   In the rare case the PCIe
> bridge is off  and it was the PCIe that caused the panic, nothing
> gets reported, and this is where we are without this commit.
> Perhaps this is what you mean by "mostly-works".  But this is the
> best that can be done with SW given our HW.

Right, my fault.  The error report registers don't look like standard
PCIe things, so I suppose they are on the host side, not the PCIe
side, so they're probably guaranteed to be accessible and non-racy
unless the bridge is in reset.

Bjorn