linux-kernel - Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMciSVVu6qL6QV7KqLem2ZoRoW2T5a3s13EyKE-4SFGHDFfR4g@mail.gmail.com>
Date: Wed, 19 Mar 2025 20:07:55 +0530
From: Naveen Kumar P <naveenkumar.parna@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org, 
	kernelnewbies <kernelnewbies@...nelnewbies.org>, linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > Hi all,
> > > >
> > > > I am writing to seek assistance with an issue we are experiencing with
> > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > Port 1 to the host bridge.
> > > >
> > > > We have observed that after booting the system, the Base Address
> > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > approximately one hour or more (the timing is inconsistent). This was
> > > > verified using the lspci output and the setpci -s 01:00.0
> > > > BASE_ADDRESS_0 command.
> > > >
> > > > To diagnose the issue, we checked the dmesg log, but it did not
> > > > provide any relevant information. I then enabled dynamic debugging for
> > > > the PCI subsystem (drivers/pci/*) and noticed the following messages
> > > > related ACPI hotplug in the dmesg log:
> > > >
> > > > [    0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff]
> > > > ...
> > > > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering
> > > > kernel.perf_event_max_sample_rate to 49000
> > > > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering
> > > > kernel.perf_event_max_sample_rate to 37000
> > > > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > After these messages appear, reading the device BAR memory results in
> > > > 0x0 instead of the expected value.
> > > >
> > > > I would like to understand the following:
> > > >
> > > > 1. What could be causing these hotplug_event debug messages?
> > >
> > > This is an ACPI Notify event.  Basically the platform is telling us to
> > > re-enumerate the hierarchy below RP01 because a device might have been
> > > added or removed.
> >
> > Thank you for your response regarding the PCI BAR reset issue we are
> > experiencing with the PLDA Device 5555. I have a few follow-up
> > questions and additional information to share.
> >
> > 1. Clarification on "Platform":
> >
> > Does the term "platform" refer to the BIOS/ACPI subsystem in this context?
>
> Yes, "platform" refers to the BIOS/ACPI subsystem.
>
> > Can the platform signal to re-enumerate the hierarchy below RP01
> > without an actual device being removed or added? In our case, the PCI
> > PLDA device is neither physically removed nor connected to the bus on
> > the fly.
>
> Yes, I think a Bus Check notification is just a request for the OS to
> re-enumerate starting at the point in the device tree where it is
> notified.  It's possible that no add or remove has occurred.  ACPI
> r6.5, sec 5.6.6, includes the example of hardware that can't detect
> device changes during a system sleep state, so it issues a Bus Check
> on wake.
>
> > 2. System Configuration:
> >
> > We are currently using an x86_64 system with Ubuntu 20.04.6 LTS
> > (kernel version: 5.4.0-148-generic).
> > I have enabled dynamic debug logs for all files in the PCI and ACPI
> > subsystems and rebooted the system with the following parameters:
> > $ cat /proc/cmdline
> > BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro
> > quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on
> > "dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p"
> >
> >
> > 3. Observations:
> >
> > After rebooting with more debug logs, I noticed the issue after 1 day,
> > 11:48 hours.
> > A snippet of the dmesg log is mentioned below (complete dmesg log is
> > attached to this email):
> >
> > [128845.248503] ACPI: GPE event 0x01
> > [128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event
> > [128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
> > hotplug_event()
>
> If you could add more debug in hotplug_event() and the things it
> calls, we might get more clues about what's happening.
>
> > 4. BAR Reset Issue:
> >
> > I filtered the lspci output to show the contents of the configuration
> > space starting at offset 0x10 for getting BASE_ADDRESS_0 by running
> > sudo lspci -xxx -s 01:00.0 | grep "10:".
> > Prior to the BAR reset issue, the lspci output was:
> > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > 10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > showed all FF's, and then the next run of the same command showed
> > BASE_ADDRESS_0 reset to zero:
> > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>
> Looks like the device isn't responding at all here.  Could happen if
> the device is reset or powered down.
I am reaching out to follow up on the PCI BAR0 reset issue and its
potential connection to the ACPI errors observed in my system running
Linux kernel 6.13.0+.

Observations:
At timestamp 77998s, repeated AE_AML_LOOP_TIMEOUT errors are present
in the kernel log for the ACPI RDWD and \_SB.PCI0.SBRG.ADP1._PSR
methods.

Below are relevant dmesg log entries
[77998.038653]
               Initialized Local Variables for Method [RDWD]:
[77998.038691]   Local1: 00000000d0fe9fee <Obj>           Integer
0000000000000015
[77998.038862]   Local6: 00000000f41c1645 <Obj>           Integer
0000000000000000
[77998.039031]   Local7: 0000000040665c5c <Obj>           Integer
000000000000AA55

[77998.039237] Initialized Arguments for Method [RDWD]:  (2 arguments
defined for method invocation)
[77998.039273]   Arg0:   00000000397dc53d <Obj>           Integer
0000000000000014
[77998.039442]   Arg1:   000000001099e334 <Obj>           Integer
0000000000000002

[77998.039673] ACPI Error: Aborting method \_SB.PCI0.SBUS.RDWD due to
previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
[77998.040455] ACPI Error: Aborting method \_SB.PCI0.SBRG.ADP1._PSR
due to previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
[77998.043659] ACPI: \_SB_.PCI0.SBRG.ADP1: Error reading AC Adapter
state: AE_AML_LOOP_TIMEOUT

[78028.442980]
               Initialized Local Variables for Method [RDWD]:
[78028.443019]   Local1: 00000000898a9a7e <Obj>           Integer
0000000000000017
[78028.443192]   Local6: 000000000cf0c853 <Obj>           Integer
0000000000000000
[78028.443362]   Local7: 00000000bb4aa65f <Obj>           Integer
000000000000AA55

[78028.443569] Initialized Arguments for Method [RDWD]:  (2 arguments
defined for method invocation)
[78028.443606]   Arg0:   00000000ef8d445b <Obj>           Integer
0000000000000016
[78028.443775]   Arg1:   000000006cacc887 <Obj>           Integer
0000000000000003

[78028.444006] ACPI Error: Aborting method \_SB.PCI0.SBUS.RDWD due to
previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
[78028.446366] ACPI Error: Aborting method \_SB.PCI0.SBRG.ADP1._PSR
due to previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
[78028.451179] ACPI: \_SB_.PCI0.SBRG.ADP1: Error reading AC Adapter
state: AE_AML_LOOP_TIMEOUT


I noticed these ACPI errors lately and then immediately ran lspci at
timestamp 80269.706159 to check the status of the PCIe device (01:00.0
- PLDA Device 5555), which resulted in all 0xFFFFFFFF reads in PCI
config accessors, indicating the device was unresponsive.
Shortly after running the lspci, an ACPI_NOTIFY_BUS_CHECK event was
triggered, and subsequently BAR0 is reset to zero.

# lspci
00:00.0 Host bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx
Series SoC Transaction Register (rev 11)
00:02.0 VGA compatible controller: Intel Corporation Atom Processor
Z36xxx/Z37xxx Series Graphics & Display (rev 11)
00:13.0 SATA controller: Intel Corporation Atom Processor E3800 Series
SATA AHCI Controller (rev 11)
00:14.0 USB controller: Intel Corporation Atom Processor
Z36xxx/Z37xxx, Celeron N2000 Series USB xHCI (rev 11)
00:17.0 SD Host controller: Intel Corporation Atom Processor E3800
Series eMMC 4.5 Controller (rev 11)
00:1a.0 Encryption controller: Intel Corporation Atom Processor
Z36xxx/Z37xxx Series Trusted Execution Engine (rev 11)
00:1b.0 Audio device: Intel Corporation Atom Processor Z36xxx/Z37xxx
Series High Definition Audio Controller (rev 11)
00:1c.0 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI
Express Root Port 1 (rev 11)
00:1c.2 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI
Express Root Port 3 (rev 11)
00:1c.3 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI
Express Root Port 4 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx
Series Power Control Unit (rev 11)
00:1f.3 SMBus: Intel Corporation Atom Processor E3800 Series SMBus
Controller (rev 11)
01:00.0 RAM memory: PLDA Device 5555 (rev ff)
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
Connection (rev 03)

In this case, the PLDA Device 5555 showing rev ff - what does this indicate?

Below are relevant dmesg log entries:

[80269.706159] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x00
len=4 data=0xffffffff
[80269.706491] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x04
len=4 data=0xffffffff
[80269.706524] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x08
len=4 data=0xffffffff
[80269.706556] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x0c
len=4 data=0xffffffff
[80269.706587] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x10
len=4 data=0xffffffff
[80269.706617] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x14
len=4 data=0xffffffff
[80269.706648] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x18
len=4 data=0xffffffff
[80269.706678] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x1c
len=4 data=0xffffffff
[80269.706709] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x20
len=4 data=0xffffffff
[80269.706739] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x24
len=4 data=0xffffffff
[80269.706769] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x28
len=4 data=0xffffffff
[80269.706800] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x2c
len=4 data=0xffffffff
[80269.707344] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x30
len=4 data=0xffffffff
[80269.707380] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x34
len=4 data=0xffffffff
[80269.707411] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x38
len=4 data=0xffffffff
[80269.707447] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x3c
len=4 data=0xffffffff
[80269.828180] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
[80269.828259] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
event 0 for deferred handling
[80269.849376] ACPI: \_SB_.PCI0.RP01: Bridge acquired in hotplug_event()
[80269.850497] ACPI: \_SB_.PCI0.RP01: calling early_dump_pci_device()
in hotplug_event() - entry
[80269.850576] pcieport 0000:00:1c.0: config space:
[80269.851128] 00000000: 86 80 48 0f 07 04 10 00 11 00 04 06 10 00 81 00
[80269.851194] 00000010: 00 00 00 00 00 00 00 00 00 01 01 00 10 10 00 20
[80269.851253] 00000020: 40 b0 70 b0 f1 ff 01 00 00 00 00 00 00 00 00 00
[80269.851313] 00000030: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 12 00
[80269.851371] 00000040: 10 80 42 01 00 80 00 00 00 00 11 00 22 4c 31 01
[80269.851430] 00000050: 00 0c 21 30 60 00 04 00 00 00 40 01 00 00 00 00
[80269.851489] 00000060: 00 00 00 00 16 00 00 00 00 00 00 00 00 00 00 00
[80269.851546] 00000070: 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
[80269.851605] 00000080: 05 90 01 00 00 20 e0 fe 20 00 00 00 00 00 00 00
[80269.851663] 00000090: 0d a0 00 00 86 80 48 0f 00 00 00 00 00 00 00 00
[80269.851721] 000000a0: 01 00 03 c8 00 00 00 00 00 00 00 00 00 00 00 00
[80269.851780] 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[80269.851837] 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[80269.851895] 000000d0: 00 c0 00 00 42 08 00 00 00 80 11 c9 00 00 00 00
[80269.851953] 000000e0: 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
[80269.852011] 000000f0: 50 00 00 00 c0 00 00 00 1a 0f 13 01 00 40 00 01
[80269.852143] pci 0000:01:00.0: Dumping configuration space for 01:00.0:
[80269.852207] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00
len=4 data=0x55551556
[80269.852275] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x04
len=4 data=0x100000
[80269.852342] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x08
len=4 data=0x5000000
[80269.852407] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x0c
len=4 data=0x0
[80269.852471] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x10
len=4 data=0x0
[80269.852537] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x14
len=4 data=0x0
[80269.852602] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x18
len=4 data=0x0
[80269.852666] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x1c
len=4 data=0x0
[80269.852731] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x20
len=4 data=0x0
[80269.852796] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x24
len=4 data=0x0
[80269.852860] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x28
len=4 data=0x0
[80269.852926] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x2c
len=4 data=0x4000
[80269.852991] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x30
len=4 data=0x0
[80269.853056] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x34
len=4 data=0x40
[80269.853122] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x38
len=4 data=0x0
[80269.853186] KERNEL PCI READ: res=0, bus=01 dev=00 func=0 pos=0x3c
len=4 data=0x1ff


Questions:
Could these ACPI errors impact PCI device access?
Given that the errors originate from _SB.PCI0.SBUS.RDWD, could this
failure disrupt PCI device communication or cause access failures?

What typically causes AE_AML_LOOP_TIMEOUT errors?
Are there known methods to avoid these errors or mitigate their impact
on PCIe devices?

Could this be a firmware/BIOS issue?
If so, what tools or steps should I use to verify and debug the ACPI tables?

I appreciate any insights or guidance on how to debug this issue further.


>
> What is this device?  What driver is bound to it?  I don't see
> anything in dmesg that identifies a driver.
>
> > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > I am not sure why lspci initially showed all FF's and then the next
> > run showed BAR0 reset.
> > Complete sudo lspci -xxx -s 01:00.0 output is captured in the attached
> > dmesg_log_pci_bar_reset.txt file.
> >
> > /sys/firmware/acpi/interrupts/gpe01:       1  EN     enabled      unmasked
> > /sys/firmware/acpi/interrupts/gpe02:       1  EN     enabled      unmasked
> >
> >
> > 5. Debugging Steps:
> >
> > Instrumenting acpiphp_check_bridge() will indicate whether we are
> > enabling or disabling a slot (enable_slot() or disable_slot()). Based
> > on the dmesg log, there is only one ACPI_NOTIFY_BUS_CHECK event, and
> > it is most likely for disable_slot(). However, does instrumenting
> > acpiphp_check_bridge() will explain why this is happening without
> > actually removing the PCI PLDA device?
>
> No, it won't explain that.  But if there was no add/remove event,
> re-enumeration should be harmless.  The objective of instrumentation
> would be to figure out why it isn't harmless in this case.
>
> > 6. Reproduction and Additional Information:
> >
> > We do not see any clear pattern or procedure to reproduce this issue.
> > Once the issue occurs, rebooting the machine resolves it, but it
> > reoccurs after an unpredictable time.
> > We have another identical hardware setup with an older kernel (Ubuntu
> > 16.04.4 LTS, kernel version: 4.4.0-66-generic), and this issue has not
> > been observed so far on that machine.
> > Any additional pointers or suggestions on how to proceed to the root
> > cause of this issue would be greatly appreciated.
>
> You're seeing the problem on v5.4 (Nov 2019), which is much newer than
> v4.4 (Jan 2016).  But v5.4 is still really too old to spend a lot of
> time on unless the problem still happens on a current kernel.
>
> Bjorn

View attachment "dmesg_march19th_log.txt" of type "text/plain" (263284 bytes)