[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cv67hkaimtzsok7ryrzup3ql7unsizw2vix5nanx252pqblifv@42d6eibemsvx>
Date: Mon, 4 Aug 2025 09:47:21 -0700
From: Breno Leitao <leitao@...ian.org>
To: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@...ux.intel.com>
Cc: Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
Jon Pan-Doh <pandoh@...gle.com>, linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] PCI/AER: Check for NULL aer_info before ratelimiting in
pci_print_aer()
Hello Sathyanarayanan
On Mon, Aug 04, 2025 at 09:11:27AM -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 8/4/25 8:35 AM, Breno Leitao wrote:
> > Hello Sathyanarayanan,
> >
> > On Mon, Aug 04, 2025 at 06:50:30AM -0700, Sathyanarayanan Kuppuswamy wrote:
> > > On 8/4/25 2:17 AM, Breno Leitao wrote:
> > > > Similarly to pci_dev_aer_stats_incr(), pci_print_aer() may be called
> > > > when dev->aer_info is NULL. Add a NULL check before proceeding to avoid
> > > > calling aer_ratelimit() with a NULL aer_info pointer, returning 1, which
> > > > does not rate limit, given this is fatal.
> > > Why not add it to pci_print_aer() ?
> > >
> > > > This prevents a kernel crash triggered by dereferencing a NULL pointer
> > > > in aer_ratelimit(), ensuring safer handling of PCI devices that lack
> > > > AER info. This change aligns pci_print_aer() with pci_dev_aer_stats_incr()
> > > > which already performs this NULL check.
> > > Is this happening during the kernel boot ? What is the frequency and steps
> > > to reproduce? I am curious about why pci_print_aer() is called for a PCI device
> > > without aer_info. Not aer_info means, that particular device is already released
> > > or in the process of release (pci_release_dev()). Is this triggered by using a stale
> > > pci_dev pointer?
> > I've reported some of these investigations in here:
> >
> > https://lore.kernel.org/all/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/
>
> It has some details. But you did not mention details like your environment, steps to
> reproduce and how often you see it. I just want to understand in what scenario
> pci_print_aer() is triggered, when releasing the device. I am wondering whether we
> are missing proper locking some where.
Oh, unfortunately I don't have these details.
I have a bunch of machine in "prod" running 6.16, and they crash from
time to time, and then I have the crashdumps.
I can get anything that crashdump provices, but, I don't have
a reproducer or the exacty steps that are triggering it.
If I can get this information from a crashdump, I am more than happy to
investigate. Can we get these information from crashdump?
Thanks,
--breno
Powered by blists - more mailing lists