[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMciSVV317JUzxoL1yO_rC_=p8hJUsfBK5UpxDFYNjjaw0ePcQ@mail.gmail.com>
Date: Mon, 10 Mar 2025 16:53:10 +0530
From: Naveen Kumar P <naveenkumar.parna@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
kernelnewbies <kernelnewbies@...nelnewbies.org>, linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
On Wed, Mar 5, 2025 at 4:14 AM Naveen Kumar P
<naveenkumar.parna@...il.com> wrote:
>
> On Wed, Mar 5, 2025 at 2:31 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> >
> > On Tue, Mar 04, 2025 at 10:19:07PM +0530, Naveen Kumar P wrote:
> > > On Tue, Mar 4, 2025 at 1:35 PM Naveen Kumar P
> > > <naveenkumar.parna@...il.com> wrote:
> > > ...
> >
> > > For this test run, I removed all three parameters (pcie_aspm=off,
> > > pci=nomsi, and pcie_ports=on) and booted with the following kernel
> > > command line arguments:
> > >
> > > cat /proc/cmdline
> > > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> > > "dyndbg=file drivers/pci/* +p; file drivers/acpi/bus.c +p; file
> > > drivers/acpi/osl.c +p"
> > >
> > > This time, the issue occurred earlier, at 22998 seconds. Below is the
> > > relevant dmesg log during the ACPI_NOTIFY_BUS_CHECK event. The
> > > complete log is attached (dmesg_march4th_log.txt).
> > >
> > > [22998.536705] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> > > [22998.536753] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
> > > event 0 for deferred handling
> > > [22998.536934] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in
> > > hotplug_event()
> > > [22998.536972] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > [22998.537002] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in
> > > hotplug_event()
> > > [22998.537024] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > > data=0x55551556
> > > [22998.537066] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > > data=0x55551556
> >
> > Fine again.
> >
> > > [22998.537094] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
> > > acpiphp_check_bridge()
> > > [22998.537155] ACPI: Device [PXSX] status [0000000f]
> > > [22998.537206] ACPI: Device [D015] status [0000000f]
> > > [22998.537276] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge
> > > in hotplug_event()
> > >
> > > sudo lspci -xxx -s 01:00.0 | grep 10:
> > > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > Obviously a problem. Can you start including the whole
> > "lspci -x -s 01:00.0" output? Obviously the Vendor ID reads above
> > worked fine. I *assume* it's still fine here, and only the BARs are
> > zeroed out?
> I've captured the complete lspci output from the last run, and it is as follows:
>
> $sudo lspci -xxx -s 01:00.0
> 01:00.0 RAM memory: PLDA Device 5555
> 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> I've also observed some inconsistencies in the behavior. In previous
> runs, the first invocation of lspci showed all FF's, and then the next
> run resulted in a PCI BAR reset, as mentioned below.
>
> Previous runs - first invocation of lspci output :
> --------------------------------------------------
> $sudo lspci -xxx -s 01:00.0
> 01:00.0 RAM memory: PLDA Device 5555 (rev ff)
> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>
> Previous runs - second invocation of lspci output :
> --------------------------------------------------
> $sudo lspci -xxx -s 01:00.0
> 01:00.0 RAM memory: PLDA Device 5555
> 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> However this time, the first run didn't show all FF's but instead
> directly resulted in a PCI BAR reset.
>
>
> >
> > I assume you saw no new dmesg logs about config accesses to the device
> > before the lspci. If you instrumented the user config accessors
> > (pci_user_read_config_*(), also in access.c), you should see those
> > accesses.
> i will try this and update you with the results soon.
> >
> > You could sprinkle some calls to early_dump_pci_device() through the
> > acpiphp path. Turn off the kernel config access tracing when you do
> > this so it doesn't clutter things up.
I instrumented the user config accessors (pci_user_read_config_*())
and added calls to early_dump_pci_device() in the ACPIPHP path. The
corresponding patch
(0003-added-early_dump_pci_device-through-the-acpiphp-path.patch) is
attached for reference.
After rebuilding the kernel with these changes, I rebooted the system
and monitored the ACPI_NOTIFY_BUS_CHECK event in dmesg, but it did not
appear initially. After waiting for three days, I manually ran lspci,
which resulted in all 0xFFs in the PCI config space from 0x00 to 0x3F
as shown below:
$ sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555 (rev ff)
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The corresponding dmesg log confirms this:
[260228.608982] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x00
len=4 data=0xffffffff
...
[260228.609741] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x3c
len=4 data=0xffffffff
[260228.640346] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x40
len=4 data=0x34801
..
[260228.640666] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0xfc
len=4 data=0x0
However, after completing this command, the ACPI_NOTIFY_BUS_CHECK
event was triggered. At this point, the early_dump_pci_device() output
reflected the correct values in the PCI config space (0x00 to 0x3F),
except for BAR0 (offset 0x10), which had been reset to 0x0:
[260228.711860] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
[260228.711902] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
event 0 for deferred handling
[260228.712004] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired
in hotplug_event()
[260228.712032] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
hotplug_event()
[260228.712053] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge
in hotplug_event()
[260228.712096] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: calling
early_dump_pci_device() before enable_slot() in acpiphp_check_bridge()
[260228.712121] pcieport 0000:00:1c.0: config space:
[260228.712217] 00000000: 86 80 48 0f 07 04 10 00 11 00 04 06 10 00 81 00
[260228.712232] 00000010: 00 00 00 00 00 00 00 00 00 01 01 00 10 10 00 20
[260228.712246] 00000020: 40 b0 70 b0 f1 ff 01 00 00 00 00 00 00 00 00 00
[260228.712258] 00000030: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 12 00
[260228.712270] 00000040: 10 80 42 01 00 80 00 00 00 00 11 00 22 4c 31 01
[260228.712282] 00000050: 00 0c 21 30 60 00 04 00 00 00 40 01 00 00 00 00
[260228.712294] 00000060: 00 00 00 00 16 00 00 00 00 00 00 00 00 00 00 00
[260228.712306] 00000070: 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
[260228.712318] 00000080: 05 90 01 00 00 20 e0 fe 20 00 00 00 00 00 00 00
[260228.712330] 00000090: 0d a0 00 00 86 80 48 0f 00 00 00 00 00 00 00 00
[260228.712342] 000000a0: 01 00 03 c8 00 00 00 00 00 00 00 00 00 00 00 00
[260228.712354] 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[260228.712366] 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[260228.712378] 000000d0: 00 c0 00 00 42 08 00 00 00 80 11 c9 00 00 00 00
[260228.712390] 000000e0: 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
[260228.712402] 000000f0: 50 00 00 00 c0 00 00 00 1a 0f 13 01 00 40 00 01
[260228.712420] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
acpiphp_check_bridge()
A second lspci command confirmed that BAR0 remained 0x00, aligning
with the early_dump_pci_device() output:
$sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555
00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
While I can now reliably reproduce the issue, I still don’t have
clarity on why BAR0 is being reset to zero?
I would like to summarize my observations as follows. Based on my
testing, I have noticed that the PCI BAR0 address (offset 0x10) gets
reset to zero under the following scenarios:
1. Spontaneous ACPI BUS CHECK Event
After booting the system, at an unpredictable time, an
ACPI_NOTIFY_BUS_CHECK event appears in the dmesg log.
Running lspci -xxx -s 01:00.0 after this event shows that BAR0 (offset
0x10) is reset to zero.
2. ACPI BUS CHECK Event Triggered After lspci Shows All 0xFF’s (Full
Config Space)
I waited for the ACPI_NOTIFY_BUS_CHECK event, but it did not appear in
the dmesg log.
Running lspci -xxx -s 01:00.0 at this stage resulted in all 0xFF's in
the PCI configuration space from offset 0x00 to 0xFF.
Immediately after this, the ACPI_NOTIFY_BUS_CHECK event appeared in dmesg.
After this event, running lspci again showed correct values in the PCI
configuration space, except BAR0 (offset 0x10) was still reset to
zero.
3. ACPI BUS CHECK Event Triggered After lspci Shows Partial 0xFF’s
(First 64 Bytes - as observed in this dmesg log)
This scenario is similar to the second one.
Running lspci -xxx -s 01:00.0 resulted in all 0xFF's from offset 0x00
to 0x3F instead of the entire config space.
The ACPI_NOTIFY_BUS_CHECK event then appeared, and lspci showed
correct values except for BAR0 (offset 0x10), which was reset to zero.
Why does reading the PCI configuration space with lspci sometimes
return 0xFF's for the first 64 bytes (offset 0x00 to 0x3F) and other
times for the entire 256-byte space (offset 0x00 to 0xFF)?
Does the kernel provide any debug messages or hooks that can help
determine why the PCI config space reads all 0xFF's before the ACPI
BUS CHECK event?
Should I add any additional instrumentation in the kernel (e.g.,
deeper tracing in access.c, ACPI hotplug path, or PCIe port handling)
to gather more information on why BAR0 is being reset?
I have attached the patch
0003-added-early_dump_pci_device-through-the-acpiphp-path.patch and
the full dmesg log for reference.
Looking forward to your thoughts.
> >
> > What is this device? Is it a shipping product? Do you have good
> The PCIe device in question is a Xilinx FPGA endpoint, which is
> flashed with RTL code to expose several host interfaces to the system
> via the PCIe link.
>
> > confidence that the hardware is working correctly? I guess you said
> > it works correctly on a different machine with an older kernel. I
> > would swap the cards between machines in case one card is broken.
> >
> > You could try bisecting between the working kernel and the broken one.
> > It's kind of painful since it takes so long to reproduce the problem.
> >
> > Bjorn
View attachment "dmesg_march9th_log.txt" of type "text/plain" (107931 bytes)
Download attachment "0003-added-early_dump_pci_device-through-the-acpiphp-path.patch" of type "application/octet-stream" (6262 bytes)
Powered by blists - more mailing lists