[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMciSVXAJwN9q_HVYfrCNH7k_+HXvp4NWn6w=dCrrw2C-_BHRg@mail.gmail.com>
Date: Tue, 18 Mar 2025 19:37:53 +0530
From: Naveen Kumar P <naveenkumar.parna@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
kernelnewbies <kernelnewbies@...nelnewbies.org>, linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
Following up on the PCI BAR0 reset issue, I have collected more data
and performed additional debugging steps. Below are the key
observations and queries:
Observations:
1. The issue occurs randomly after a few days of system uptime.
2. When the issue occurs, reading the PCI configuration space (lspci
-xxx -s 01:00.0) sometimes returns all 0xffs from offset 0x00 to 0x3F
or the entire config space.
$ sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555 (rev ff)
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
$ uptime
11:29:59 up 3 days, 14:39, 1 user, load average: 0.75, 0.57, 0.47
3. After this, subsequent reads restore non-0xff values, but BAR0 gets
reset to zero.
$ sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555
00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 02 00 c2 8f 00 00 10 28 01 00 21 f4 03 00
70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Debugging Steps Taken:
Used ftrace to track relevant PCI, hotplug, and BIOS-related function calls:
echo function > current_tracer
echo '*pci* *hotplug* *bios*' > set_ftrace_filter
Captured a 19MB trace file, so sharing the full trace isn't feasible.
few lines of the dmesg log, where it returned "all ones" from config
space is mentioned below:
[311972.545818] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x00
len=4 data=0xffffffff
[311972.564790] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x04
len=4 data=0xffffffff
[311972.564826] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x08
len=4 data=0xffffffff
[311972.564840] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x0c
len=4 data=0xffffffff
[311972.564848] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x10
len=4 data=0xffffffff
[311972.564856] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x14
len=4 data=0xffffffff
the corresponding function tracer log is mentioned below:
node_exporter-1484 [000] ..... 311972.543102:
acpi_ex_pci_config_space_handler <-acpi_ev_address_space_dispatch
node_exporter-1484 [000] ..... 311972.543104:
acpi_os_read_pci_configuration <-acpi_ex_pci_config_space_handler
node_exporter-1484 [000] ..... 311972.543104: raw_pci_read
<-acpi_os_read_pci_configuration
node_exporter-1484 [000] ..... 311972.543105: pci_conf1_read
<-raw_pci_read
lspci-242643 [001] ..... 311972.557702:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557707:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557708:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557710:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557711:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557712:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557712:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557713:
pci_resource_to_user <-resource_show
lspci-242643 [001] ..... 311972.557714:
pci_resource_to_user <-resource_show
Questions & Next Steps:
1. Device Inaccessibility:
When reading all 0xffs from config space, does this indicate the
device is reset or powered down?
Shouldn't the kernel receive a hotplug event or error notification in this case?
2. ftrace Analysis:
Are there any additional function filters I should apply to
set_ftrace_filter to capture relevant events?
Any specific function calls in the trace file that I should focus on
for debugging?
Any recommended PCIe registers to check?
Let me know your thoughts and any additional suggestions.
On Tue, Mar 11, 2025 at 11:34 PM Naveen Kumar P
<naveenkumar.parna@...il.com> wrote:
>
> On Mon, Mar 10, 2025 at 4:53 PM Naveen Kumar P
> <naveenkumar.parna@...il.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 4:14 AM Naveen Kumar P
> > <naveenkumar.parna@...il.com> wrote:
> > >
> > > On Wed, Mar 5, 2025 at 2:31 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > >
> > > > On Tue, Mar 04, 2025 at 10:19:07PM +0530, Naveen Kumar P wrote:
> > > > > On Tue, Mar 4, 2025 at 1:35 PM Naveen Kumar P
> > > > > <naveenkumar.parna@...il.com> wrote:
> > > > > ...
> > > >
> > > > > For this test run, I removed all three parameters (pcie_aspm=off,
> > > > > pci=nomsi, and pcie_ports=on) and booted with the following kernel
> > > > > command line arguments:
> > > > >
> > > > > cat /proc/cmdline
> > > > > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> > > > > "dyndbg=file drivers/pci/* +p; file drivers/acpi/bus.c +p; file
> > > > > drivers/acpi/osl.c +p"
> > > > >
> > > > > This time, the issue occurred earlier, at 22998 seconds. Below is the
> > > > > relevant dmesg log during the ACPI_NOTIFY_BUS_CHECK event. The
> > > > > complete log is attached (dmesg_march4th_log.txt).
> > > > >
> > > > > [22998.536705] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> > > > > [22998.536753] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
> > > > > event 0 for deferred handling
> > > > > [22998.536934] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in
> > > > > hotplug_event()
> > > > > [22998.536972] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > > [22998.537002] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in
> > > > > hotplug_event()
> > > > > [22998.537024] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > > > > data=0x55551556
> > > > > [22998.537066] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > > > > data=0x55551556
> > > >
> > > > Fine again.
> > > >
> > > > > [22998.537094] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
> > > > > acpiphp_check_bridge()
> > > > > [22998.537155] ACPI: Device [PXSX] status [0000000f]
> > > > > [22998.537206] ACPI: Device [D015] status [0000000f]
> > > > > [22998.537276] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge
> > > > > in hotplug_event()
> > > > >
> > > > > sudo lspci -xxx -s 01:00.0 | grep 10:
> > > > > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >
> > > > Obviously a problem. Can you start including the whole
> > > > "lspci -x -s 01:00.0" output? Obviously the Vendor ID reads above
> > > > worked fine. I *assume* it's still fine here, and only the BARs are
> > > > zeroed out?
> > > I've captured the complete lspci output from the last run, and it is as follows:
> > >
> > > $sudo lspci -xxx -s 01:00.0
> > > 01:00.0 RAM memory: PLDA Device 5555
> > > 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
> > > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> > > 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> > > 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> > > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> > > 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> > > 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > I've also observed some inconsistencies in the behavior. In previous
> > > runs, the first invocation of lspci showed all FF's, and then the next
> > > run resulted in a PCI BAR reset, as mentioned below.
> > >
> > > Previous runs - first invocation of lspci output :
> > > --------------------------------------------------
> > > $sudo lspci -xxx -s 01:00.0
> > > 01:00.0 RAM memory: PLDA Device 5555 (rev ff)
> > > 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > 90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > > f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > >
> > > Previous runs - second invocation of lspci output :
> > > --------------------------------------------------
> > > $sudo lspci -xxx -s 01:00.0
> > > 01:00.0 RAM memory: PLDA Device 5555
> > > 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
> > > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> > > 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> > > 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> > > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> > > 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> > > 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > However this time, the first run didn't show all FF's but instead
> > > directly resulted in a PCI BAR reset.
> > >
> > >
> > > >
> > > > I assume you saw no new dmesg logs about config accesses to the device
> > > > before the lspci. If you instrumented the user config accessors
> > > > (pci_user_read_config_*(), also in access.c), you should see those
> > > > accesses.
> > > i will try this and update you with the results soon.
> > > >
> > > > You could sprinkle some calls to early_dump_pci_device() through the
> > > > acpiphp path. Turn off the kernel config access tracing when you do
> > > > this so it doesn't clutter things up.
> > I instrumented the user config accessors (pci_user_read_config_*())
> > and added calls to early_dump_pci_device() in the ACPIPHP path. The
> > corresponding patch
> > (0003-added-early_dump_pci_device-through-the-acpiphp-path.patch) is
> > attached for reference.
> >
> > After rebuilding the kernel with these changes, I rebooted the system
> > and monitored the ACPI_NOTIFY_BUS_CHECK event in dmesg, but it did not
> > appear initially. After waiting for three days, I manually ran lspci,
> > which resulted in all 0xFFs in the PCI config space from 0x00 to 0x3F
> > as shown below:
> > $ sudo lspci -xxx -s 01:00.0
> > 01:00.0 RAM memory: PLDA Device 5555 (rev ff)
> > 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> > 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> > 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> >
> > The corresponding dmesg log confirms this:
> >
> > [260228.608982] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x00
> > len=4 data=0xffffffff
> > ...
> > [260228.609741] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x3c
> > len=4 data=0xffffffff
> > [260228.640346] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0x40
> > len=4 data=0x34801
> > ..
> > [260228.640666] USER PCI READ: ret=0, bus=01 dev=00 func=0 pos=0xfc
> > len=4 data=0x0
> >
> >
> > However, after completing this command, the ACPI_NOTIFY_BUS_CHECK
> > event was triggered. At this point, the early_dump_pci_device() output
> > reflected the correct values in the PCI config space (0x00 to 0x3F),
> > except for BAR0 (offset 0x10), which had been reset to 0x0:
> >
> > [260228.711860] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> > [260228.711902] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
> > event 0 for deferred handling
> > [260228.712004] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired
> > in hotplug_event()
> > [260228.712032] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
> > hotplug_event()
> > [260228.712053] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge
> > in hotplug_event()
> > [260228.712096] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: calling
> > early_dump_pci_device() before enable_slot() in acpiphp_check_bridge()
> > [260228.712121] pcieport 0000:00:1c.0: config space:
> > [260228.712217] 00000000: 86 80 48 0f 07 04 10 00 11 00 04 06 10 00 81 00
> > [260228.712232] 00000010: 00 00 00 00 00 00 00 00 00 01 01 00 10 10 00 20
> > [260228.712246] 00000020: 40 b0 70 b0 f1 ff 01 00 00 00 00 00 00 00 00 00
> > [260228.712258] 00000030: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 12 00
> > [260228.712270] 00000040: 10 80 42 01 00 80 00 00 00 00 11 00 22 4c 31 01
> > [260228.712282] 00000050: 00 0c 21 30 60 00 04 00 00 00 40 01 00 00 00 00
> > [260228.712294] 00000060: 00 00 00 00 16 00 00 00 00 00 00 00 00 00 00 00
> > [260228.712306] 00000070: 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [260228.712318] 00000080: 05 90 01 00 00 20 e0 fe 20 00 00 00 00 00 00 00
> > [260228.712330] 00000090: 0d a0 00 00 86 80 48 0f 00 00 00 00 00 00 00 00
> > [260228.712342] 000000a0: 01 00 03 c8 00 00 00 00 00 00 00 00 00 00 00 00
> > [260228.712354] 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [260228.712366] 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > [260228.712378] 000000d0: 00 c0 00 00 42 08 00 00 00 80 11 c9 00 00 00 00
> > [260228.712390] 000000e0: 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
> > [260228.712402] 000000f0: 50 00 00 00 c0 00 00 00 1a 0f 13 01 00 40 00 01
> > [260228.712420] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
> > acpiphp_check_bridge()
> >
> >
> > A second lspci command confirmed that BAR0 remained 0x00, aligning
> > with the early_dump_pci_device() output:
> > $sudo lspci -xxx -s 01:00.0
> > 01:00.0 RAM memory: PLDA Device 5555
> > 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
> > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> > 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> > 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
> > 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
> > 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> > 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> > a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > While I can now reliably reproduce the issue, I still don’t have
> > clarity on why BAR0 is being reset to zero?
> >
> > I would like to summarize my observations as follows. Based on my
> > testing, I have noticed that the PCI BAR0 address (offset 0x10) gets
> > reset to zero under the following scenarios:
> >
> > 1. Spontaneous ACPI BUS CHECK Event
> > After booting the system, at an unpredictable time, an
> > ACPI_NOTIFY_BUS_CHECK event appears in the dmesg log.
> > Running lspci -xxx -s 01:00.0 after this event shows that BAR0 (offset
> > 0x10) is reset to zero.
> >
> > 2. ACPI BUS CHECK Event Triggered After lspci Shows All 0xFF’s (Full
> > Config Space)
> > I waited for the ACPI_NOTIFY_BUS_CHECK event, but it did not appear in
> > the dmesg log.
> > Running lspci -xxx -s 01:00.0 at this stage resulted in all 0xFF's in
> > the PCI configuration space from offset 0x00 to 0xFF.
> > Immediately after this, the ACPI_NOTIFY_BUS_CHECK event appeared in dmesg.
> > After this event, running lspci again showed correct values in the PCI
> > configuration space, except BAR0 (offset 0x10) was still reset to
> > zero.
> >
> > 3. ACPI BUS CHECK Event Triggered After lspci Shows Partial 0xFF’s
> > (First 64 Bytes - as observed in this dmesg log)
> > This scenario is similar to the second one.
> > Running lspci -xxx -s 01:00.0 resulted in all 0xFF's from offset 0x00
> > to 0x3F instead of the entire config space.
> > The ACPI_NOTIFY_BUS_CHECK event then appeared, and lspci showed
> > correct values except for BAR0 (offset 0x10), which was reset to zero.
> >
> >
> > Why does reading the PCI configuration space with lspci sometimes
> > return 0xFF's for the first 64 bytes (offset 0x00 to 0x3F) and other
> > times for the entire 256-byte space (offset 0x00 to 0xFF)?
> >
> > Does the kernel provide any debug messages or hooks that can help
> > determine why the PCI config space reads all 0xFF's before the ACPI
> > BUS CHECK event?
> >
> > Should I add any additional instrumentation in the kernel (e.g.,
> > deeper tracing in access.c, ACPI hotplug path, or PCIe port handling)
> > to gather more information on why BAR0 is being reset?
> >
> > I have attached the patch
> > 0003-added-early_dump_pci_device-through-the-acpiphp-path.patch and
> > the full dmesg log for reference.
> >
> > Looking forward to your thoughts.
> >
> > > >
> > > > What is this device? Is it a shipping product? Do you have good
> > > The PCIe device in question is a Xilinx FPGA endpoint, which is
> > > flashed with RTL code to expose several host interfaces to the system
> > > via the PCIe link.
> > >
> > > > confidence that the hardware is working correctly? I guess you said
> > > > it works correctly on a different machine with an older kernel. I
> > > > would swap the cards between machines in case one card is broken.
> The issue occurs randomly after a few days of system uptime,
> post-boot, during runtime.
> The same PCIe card is currently working fine on another machine
> running Ubuntu 16.04 with kernel 4.4.0-66-generic. However, the
> affected machine is running Ubuntu 20.04 with a newly built latest
> kernel. Full testing on the older kernel is still on hold, as we are
> currently focusing on recent kernels.
> Apart from kernel differences, the BIOS versions are also different on
> these systems.
> Given that the issue happens at runtime and not during initialization,
> could BIOS influence this behavior after boot?
>
> BAR Register Differences & Configuration Source:
> ------------------------------------------------------------------
> The BASE_ADDRESS_0 register differs between the two machines:
> Ubuntu 20.04 (buggy system): 0xb0400000
> Ubuntu 16.04 (working system): 0xd0400000
>
> I would like to understand who determines this value:
> Is it set by BIOS, firmware, or the Linux kernel PCI driver?
> Could you provide a reference to the source code where the BAR
> registers are assigned and mapped?
>
> Earlier, CONFIG_PCI_DEBUG was not set. I have now enabled it and
> started the kernel build. Are there any other important CONFIG debug
> options that need to be enabled for debugging this?
>
> > > >
> > > > You could try bisecting between the working kernel and the broken one.
> > > > It's kind of painful since it takes so long to reproduce the problem.
> > > >
> > > > Bjorn
Powered by blists - more mailing lists