linux-kernel - Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8ff6c436-74bc-43f0-b5a6-3085ded52d02@broadcom.com>
Date: Thu, 7 Aug 2025 10:00:43 -0700
From: Florian Fainelli <florian.fainelli@...adcom.com>
To: Manivannan Sadhasivam <mani@...nel.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>,
 Jim Quinlan <james.quinlan@...adcom.com>, linux-pci@...r.kernel.org,
 Nicolas Saenz Julienne <nsaenz@...nel.org>,
 Bjorn Helgaas <bhelgaas@...gle.com>,
 Lorenzo Pieralisi <lorenzo.pieralisi@....com>,
 Cyril Brulebois <kibi@...ian.org>, bcm-kernel-feedback-list@...adcom.com,
 jim2101024@...il.com, Lorenzo Pieralisi <lpieralisi@...nel.org>,
 Krzysztof Wilczyński <kwilczynski@...nel.org>,
 Rob Herring <robh@...nel.org>,
 "moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE"
 <linux-rpi-kernel@...ts.infradead.org>,
 "moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE"
 <linux-arm-kernel@...ts.infradead.org>,
 open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

On 8/6/25 22:26, Manivannan Sadhasivam wrote:
> On Wed, Aug 06, 2025 at 01:41:35PM GMT, Florian Fainelli wrote:
>> On 8/6/25 11:50, Bjorn Helgaas wrote:
>>>> I'm not sure I understand the "racy" comment.  If the PCIe bridge is
>>>> off, we do not read the PCIe error registers.  In this case, PCIe is
>>>> probably not the cause of the panic.   In the rare case the PCIe
>>>> bridge is off  and it was the PCIe that caused the panic, nothing
>>>> gets reported, and this is where we are without this commit.
>>>> Perhaps this is what you mean by "mostly-works".  But this is the
>>>> best that can be done with SW given our HW.
>>>
>>> Right, my fault.  The error report registers don't look like standard
>>> PCIe things, so I suppose they are on the host side, not the PCIe
>>> side, so they're probably guaranteed to be accessible and non-racy
>>> unless the bridge is in reset.
>>
>> To expand upon that part, the situation that I ran in we had the PCIe link
>> down and therefore clock gated the PCIe root complex hardware to conserve
>> power. Eventually I did hit a voluntary panic, and since all panic notifiers
>> registered are invoked in succession, the one registered for the PCIe RC was
>> invoked as well and accessing clock gated registers would not work and
>> trigger another fault which would be confusing and mingle with the panic I
>> was trying to debug initially. Hence this check, and a clock gated PCIe RC
>> would not be logging any errors anyway.
> 
> May I ask how you are recovering from link down? Can the driver detect link down
> using any platform IRQ?

Just to be clear, what I was describing here is not a link down 
recovery. The point I was trying to convey is that we have multiple 
busses in our system (DRAM, on-chip registers, PCIe) and each one of 
them has its own way of reporting errors, so if we get a form of system 
error/kernel panic we like to interrogate each one of them to figure out 
the cause. In the case I was describing, I was actually tracking down a 
bad DRAM access, but the error reporting came from the on-chip register 
arbiter because prior to that we had been trying to read from the clock 
gated PCIe bridge whether the PCIe bridge was responsible for the bad 
access. This leads you to an incorrect source of the bad access, and so 
that's why we guard the panic handler invocation within the PCIe root 
complex with a check whether the bridge is in reset or not.

If this is still not clear, let me know.
-- 
Florian