[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMciSVXDS_n7-XzHevMmAOhb-qCNsCBbE1Pym-zWybnOyjZWmw@mail.gmail.com>
Date: Mon, 24 Feb 2025 17:45:35 +0530
From: Naveen Kumar P <naveenkumar.parna@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
kernelnewbies <kernelnewbies@...nelnewbies.org>, linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> [+cc linux-acpi]
>
> On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > Hi all,
> >
> > I am writing to seek assistance with an issue we are experiencing with
> > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > Port 1 to the host bridge.
> >
> > We have observed that after booting the system, the Base Address
> > Register (BAR0) memory of this device gets reset to 0x0 after
> > approximately one hour or more (the timing is inconsistent). This was
> > verified using the lspci output and the setpci -s 01:00.0
> > BASE_ADDRESS_0 command.
> >
> > To diagnose the issue, we checked the dmesg log, but it did not
> > provide any relevant information. I then enabled dynamic debugging for
> > the PCI subsystem (drivers/pci/*) and noticed the following messages
> > related ACPI hotplug in the dmesg log:
> >
> > [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff]
> > ...
> > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering
> > kernel.perf_event_max_sample_rate to 49000
> > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering
> > kernel.perf_event_max_sample_rate to 37000
> > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > After these messages appear, reading the device BAR memory results in
> > 0x0 instead of the expected value.
> >
> > I would like to understand the following:
> >
> > 1. What could be causing these hotplug_event debug messages?
>
> This is an ACPI Notify event. Basically the platform is telling us to
> re-enumerate the hierarchy below RP01 because a device might have been
> added or removed.
Thank you for your response regarding the PCI BAR reset issue we are
experiencing with the PLDA Device 5555. I have a few follow-up
questions and additional information to share.
1. Clarification on "Platform":
Does the term "platform" refer to the BIOS/ACPI subsystem in this context?
Can the platform signal to re-enumerate the hierarchy below RP01
without an actual device being removed or added? In our case, the PCI
PLDA device is neither physically removed nor connected to the bus on
the fly.
2. System Configuration:
We are currently using an x86_64 system with Ubuntu 20.04.6 LTS
(kernel version: 5.4.0-148-generic).
I have enabled dynamic debug logs for all files in the PCI and ACPI
subsystems and rebooted the system with the following parameters:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro
quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on
"dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p"
3. Observations:
After rebooting with more debug logs, I noticed the issue after 1 day,
11:48 hours.
A snippet of the dmesg log is mentioned below (complete dmesg log is
attached to this email):
[128845.248503] ACPI: GPE event 0x01
[128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event
[128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
hotplug_event()
4. BAR Reset Issue:
I filtered the lspci output to show the contents of the configuration
space starting at offset 0x10 for getting BASE_ADDRESS_0 by running
sudo lspci -xxx -s 01:00.0 | grep "10:".
Prior to the BAR reset issue, the lspci output was:
$ sudo lspci -xxx -s 01:00.0 | grep "10:"
10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00
During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
showed all FF's, and then the next run of the same command showed
BASE_ADDRESS_0 reset to zero:
$ sudo lspci -xxx -s 01:00.0 | grep "10:"
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
$ sudo lspci -xxx -s 01:00.0 | grep "10:"
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
I am not sure why lspci initially showed all FF's and then the next
run showed BAR0 reset.
Complete sudo lspci -xxx -s 01:00.0 output is captured in the attached
dmesg_log_pci_bar_reset.txt file.
/sys/firmware/acpi/interrupts/gpe01: 1 EN enabled unmasked
/sys/firmware/acpi/interrupts/gpe02: 1 EN enabled unmasked
5. Debugging Steps:
Instrumenting acpiphp_check_bridge() will indicate whether we are
enabling or disabling a slot (enable_slot() or disable_slot()). Based
on the dmesg log, there is only one ACPI_NOTIFY_BUS_CHECK event, and
it is most likely for disable_slot(). However, does instrumenting
acpiphp_check_bridge() will explain why this is happening without
actually removing the PCI PLDA device?
6. Reproduction and Additional Information:
We do not see any clear pattern or procedure to reproduce this issue.
Once the issue occurs, rebooting the machine resolves it, but it
reoccurs after an unpredictable time.
We have another identical hardware setup with an older kernel (Ubuntu
16.04.4 LTS, kernel version: 4.4.0-66-generic), and this issue has not
been observed so far on that machine.
Any additional pointers or suggestions on how to proceed to the root
cause of this issue would be greatly appreciated.
Thank you for your assistance.
>
> Unfortunately the only real information we get is the ACPI device
> (RP01) and the notification value (ACPI_NOTIFY_BUS_CHECK).
>
> You could instrument acpiphp_check_bridge() to see what path we take.
> The main paths look like enable_slot() or disable_slot(), but those
> both include a pr_debug() than you apparently don't see.
>
> A remove followed by add would definitely reset the device, including
> its BARs. But you would normally see some messages related to
> enumerating a new device.
>
> If this doesn't help, try to reproduce the problem with a recent
> kernel, e.g., v6.13, and post the complete dmesg log.
>
> > 2. Why does this result in the BAR memory being reset?
> > 3. How can we resolve this issue?
> >
> > I have verified that the issue occurs even without loading the driver
> > for the PLDA Device 5555, so it does not appear to be related to the
> > device driver.
> >
> > Any help or guidance on debugging this issue would be greatly appreciated.
> >
> > Thank you for your assistance.
> >
> > Best regards,
> > Naveen
View attachment "dmesg_log_pci_bar_reset.txt" of type "text/plain" (82778 bytes)
Powered by blists - more mailing lists