linux-kernel - Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250224195423.GA473540@bhelgaas>
Date: Mon, 24 Feb 2025 13:54:23 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Naveen Kumar P <naveenkumar.parna@...il.com>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
	kernelnewbies <kernelnewbies@...nelnewbies.org>,
	linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > Hi all,
> > > > >
> > > > > I am writing to seek assistance with an issue we are experiencing with
> > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > > Port 1 to the host bridge.
> > > > >
> > > > > We have observed that after booting the system, the Base Address
> > > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > > approximately one hour or more (the timing is inconsistent). This was
> > > > > verified using the lspci output and the setpci -s 01:00.0
> > > > > BASE_ADDRESS_0 command.

> ...
> I booted with the pcie_aspm=off kernel parameter, which means that
> PCIe Active State Power Management (ASPM) is disabled. Given this
> context, should I consider removing this setting to see if it affects
> the occurrence of the Bus Check notifications and the BAR0 reset
> issue?

Doesn't seem likely to be related.  Once configured, ASPM operates
without any software intervention.  But note that "pcie_aspm=off"
means the kernel doesn't touch ASPM configuration at all, and any
configuration done by firmware remains in effect.

You can tell whether ASPM has been enabled by firmware with "sudo
lspci -vv" before the problem occurs.

> > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > > showed all FF's, and then the next run of the same command showed
> > > BASE_ADDRESS_0 reset to zero:
> > > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >
> > Looks like the device isn't responding at all here.  Could happen if
> > the device is reset or powered down.
>
> From the kernel driver or user space tools, is it possible to
> determine whether the device has been reset or powered down?  Are
> there any power management settings or configurations that could be
> causing the device to reset or power down unexpectedly?

Not really.  By "powered down", I meant D3cold, where the main power
is removed.  Config space is readable in all other power states.

> > What is this device?  What driver is bound to it?  I don't see
> > anything in dmesg that identifies a driver.
>
> The PCIe device in question is a Xilinx FPGA endpoint, which is
> flashed with RTL code to expose several host interfaces to the system
> via the PCIe link.
> 
> We have an out-of-tree driver for this device, but to eliminate the
> driver's role in this issue, I renamed the driver to prevent it from
> loading automatically after rebooting the machine. Despite not using
> the driver, the issue still occurred.

Oh, right, I forgot that you mentioned this before.

> > You're seeing the problem on v5.4 (Nov 2019), which is much newer than
> > v4.4 (Jan 2016).  But v5.4 is still really too old to spend a lot of
> > time on unless the problem still happens on a current kernel.

This part is important.  We don't want to spend a lot of time
debugging an issue that may have already been fixed upstream.

Bjorn