[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMciSVU2Xeh+3KsFK33GGLK7h59n9A_1RANdFV+ghGv39qcxPw@mail.gmail.com>
Date: Tue, 4 Mar 2025 13:35:14 +0530
From: Naveen Kumar P <naveenkumar.parna@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
kernelnewbies <kernelnewbies@...nelnewbies.org>, linux-acpi@...r.kernel.org
Subject: Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset
On Fri, Feb 28, 2025 at 9:31 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Wed, Feb 26, 2025 at 06:28:33PM +0530, Naveen Kumar P wrote:
> > On Wed, Feb 26, 2025 at 2:08 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > On Tue, Feb 25, 2025 at 06:46:02PM +0530, Naveen Kumar P wrote:
> > > > On Tue, Feb 25, 2025 at 1:24 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> > > > > > On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > > > > > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I am writing to seek assistance with an issue we are
> > > > > > > > > > experiencing with a PCIe device (PLDA Device 5555)
> > > > > > > > > > connected through PCI Express Root Port 1 to the
> > > > > > > > > > host bridge.
> > > > > > > > > >
> > > > > > > > > > We have observed that after booting the system, the
> > > > > > > > > > Base Address Register (BAR0) memory of this device
> > > > > > > > > > gets reset to 0x0 after approximately one hour or
> > > > > > > > > > more (the timing is inconsistent). This was verified
> > > > > > > > > > using the lspci output and the setpci -s 01:00.0
> > > > > > > > > > BASE_ADDRESS_0 command.
> > > > > > ...
>
> > I have downloaded the 6.13 kernel source and added additional debug
> > logs in hotplug_event(), then built the kernel. After that rebooted
> > with the new kernel using the following parameters:
> > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> > libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on "dyndbg=file
> > drivers/pci/* +p; file drivers/acpi/* +p"
>
> Why "pci=nomsi"? I don't think that should make a difference. Also,
> it contributes to the fact that Linux doesn't request OS control of
> several features that it ordinarily does, so you end up in a somewhat
> unusual state (which *should* still work, of course):
>
> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments HPX-Type3]
> acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
>
> Same for "pcie_aspm=off".
I initially suspected that the PCI BAR reset was happening due to the
device entering a low-power state, so I set pcie_aspm=off to prevent
it. However, I am not sure why pci=nomsi and pcie_ports=on were used
in the test machine. In the next test run, I will remove these
parameters and try again.
>
> Why "pcie_ports=on"? That's not a valid parameter:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/portdrv.c?id=v6.13#n619
>
> > Complete dmesg log and the patch(to get additional debug information)
> > are attached to this email.
> >
> > Any further guidance on these observations?
>
> I'm out of ideas. I would instrument the PCI config accessors to log
> all the reads and writes to your device (01:00.0) to see what we do to
> the device. Maybe there's some hint:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/access.c?id=v6.13#n35
As per your suggestion, I instrumented the PCI configuration accessors
to log all reads and writes to my device (01:00.0). The corresponding
patch (0002-instrumented-the-PCI-config-accessors-to-log-all-the.patch)
is attached to this email. After applying the patch and rebooting with
the same boot parameters, the issue reproduced after 193890 seconds.
The complete dmesg log (dmesg_march3rd_log.txt) is also attached.
Could you check if this new log provides any useful clues?
Additionally, do you recommend any further instrumentation or debugging steps?
Looking forward to your insights.
>
> > Additionally, I noticed that the initial bootup logs with the
> > "0.000000" timestamp are missing in the dmesg log with this new
> > kernel. I'm unsure what might be causing this issue.
>
> Probably overflowed the message buffer. You can try increasing the
> buffer size:
I have noticed that even the kern.log file also missed those initial
bootup logs. If the message buffer is overflowed and the dmesg log
misses those entries, they should appear in kern.log, right? However,
this time when I rebooted, I got the complete log without increasing
the buffer size.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/kernel-parameters.txt?id=v6.13#n3190
>
> You can also experiment with the dyndbg parameter to be more selective
> about the ACPI messages if some aren't useful.
>
> Bjorn
View attachment "dmesg_march3rd_log.txt" of type "text/plain" (100479 bytes)
Download attachment "0002-instrumented-the-PCI-config-accessors-to-log-all-the.patch" of type "application/octet-stream" (1523 bytes)
Download attachment "0001-added-more-debug-logs-in-hotplug_event-acpiphp_check.patch" of type "application/octet-stream" (3781 bytes)
Powered by blists - more mailing lists