linux-kernel - Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+-6iNzAUpMfP8z=zXbQsz=4=YMYgxdSbpDucchECieqpzAzwg@mail.gmail.com>
Date: Wed, 6 Aug 2025 15:16:07 -0400
From: Jim Quinlan <james.quinlan@...adcom.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, Nicolas Saenz Julienne <nsaenz@...nel.org>, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Lorenzo Pieralisi <lorenzo.pieralisi@....com>, 
	Cyril Brulebois <kibi@...ian.org>, bcm-kernel-feedback-list@...adcom.com, 
	jim2101024@...il.com, Florian Fainelli <florian.fainelli@...adcom.com>, 
	Lorenzo Pieralisi <lpieralisi@...nel.org>, Krzysztof Wilczyński <kwilczynski@...nel.org>, 
	Manivannan Sadhasivam <mani@...nel.org>, Rob Herring <robh@...nel.org>, 
	"moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE" <linux-rpi-kernel@...ts.infradead.org>, 
	"moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE" <linux-arm-kernel@...ts.infradead.org>, 
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

On Wed, Aug 6, 2025 at 2:50 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Wed, Aug 06, 2025 at 02:38:12PM -0400, Jim Quinlan wrote:
> > On Wed, Aug 6, 2025 at 2:15 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > >
> > > On Fri, Jun 13, 2025 at 06:08:43PM -0400, Jim Quinlan wrote:
> > > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> > > > by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
> > > > 7216 and its descendants -- have new HW that identifies error details.
> > >
> > > What's the long term plan for this?  This abort is a huge problem that
> > > we're seeing across arm64 platforms.  Forcing a panic and reboot for
> > > every uncorrectable error is pretty hard to deal with.
> >
> > Are you referring to STB/CM systems, Rpi, or something else altogether?
>
> Just in general.  I saw this recently with a Nuvoton NPCM8xx PCIe
> controller.  I'm not an arm64 guy, but I've been told that these
> aborts are basically unrecoverable from a kernel perspective.  For
> some reason several PCIe controllers intended for arm64 seem to raise
> aborts on PCIe errors.  At the moment, that means we can't recover
> from errors like surprise unplugs and other things that *should* be
> recoverable (perhaps at the cost of resetting or disabling a PCIe
> device).
FWIW, our original RC controller was paired with MIPs, so it could be
that a number of non-x86 camps just went with the panic-y behavior.

I believe that the PCIe spec allows this rude behavior, or doesn't
specifically disallow it.  I also remember that there is an ARM
standard initiative for ARM-based systems that requires the PCIe
error-gets-0xffffffff behavior.  We obviously don't conform.   At any
rate, I will send an email now to the HW folks I know to remind them
that we need this behavior, at least as a configurable option.

Regards,
Jim Quinlan
Broadcom STB/CM
>
> > > Is there a plan to someday recover from these aborts?  Or change the
> > > hardware so it can at least be configured to return ~0 data after
> > > logging the error in the hardware registers?
> >
> > Some of our upcoming chips will have the ability to do nothing on
> > errant PCIe writes and return 0xffffffff on errant PCIe reads.   But
> > none of our STB/CM chips do this currently.   I've been asking for
> > this behavior for years but I have limited influence on what happens
> > in HW.
>
> Fingers crossed for either that or some other way to make these things
> recoverable.
>
> > > > This simple handler determines if the PCIe controller was the
> > > > cause of the abort and if so, prints out diagnostic info.
> > > > Unfortunately, an abort still occurs.
> > > >
> > > > Care is taken to read the error registers only when the PCIe
> > > > bridge is active and the PCIe registers are acceptable.
> > > > Otherwise, a "die" event caused by something other than the PCIe
> > > > could cause an abort if the PCIe "die" handler tried to access
> > > > registers when the bridge is off.
> > >
> > > Checking whether the bridge is active is a "mostly-works"
> > > situation since it's always racy.
> >
> > I'm not sure I understand the "racy" comment.  If the PCIe bridge is
> > off, we do not read the PCIe error registers.  In this case, PCIe is
> > probably not the cause of the panic.   In the rare case the PCIe
> > bridge is off  and it was the PCIe that caused the panic, nothing
> > gets reported, and this is where we are without this commit.
> > Perhaps this is what you mean by "mostly-works".  But this is the
> > best that can be done with SW given our HW.
>
> Right, my fault.  The error report registers don't look like standard
> PCIe things, so I suppose they are on the host side, not the PCIe
> side, so they're probably guaranteed to be accessible and non-racy
> unless the bridge is in reset.
>
> Bjorn

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4197 bytes)