[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251001083326.GC15786@p1gen4-pw042f0m.boeblingen.de.ibm.com>
Date: Wed, 1 Oct 2025 10:33:26 +0200
From: Benjamin Block <bblock@...ux.ibm.com>
To: Farhan Ali <alifm@...ux.ibm.com>
Cc: linux-s390@...r.kernel.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
alex.williamson@...hat.com, helgaas@...nel.org, clg@...hat.com,
schnelle@...ux.ibm.com, mjrosato@...ux.ibm.com
Subject: Re: [PATCH v4 02/10] PCI: Add additional checks for flr reset
On Tue, Sep 30, 2025 at 10:04:25AM -0700, Farhan Ali wrote:
>
> On 9/30/2025 3:03 AM, Benjamin Block wrote:
> > On Wed, Sep 24, 2025 at 10:16:20AM -0700, Farhan Ali wrote:
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index a3d93d1baee7..327fefc6a1eb 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
--8<--
> >> + if (pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, ®)) {
> >> + pci_warn(dev, "Device unable to do an FLR\n");
> >> + return -ENOTTY;
> >> + }
> > Just thinking out loud, not sure whether it make sense, but since you already
> > read an up-to-date value from the config space, would it make sense to
> > pull the check above `dev->devcap & PCI_EXP_DEVCAP_FLR` below this read, and
> > check on the just read `reg`?
>
> My thinking was we could exit early if the device never had FLR
> capability (and so was not cached in devcap). This way we avoid an extra
> PCI read.
That makes sense.
> > Also wondering whether it makes sense to stable-tag this? We've recently seen
> > "unpleasant" recovery attempts that look like this in the kernel logs:
> >
> > [ 663.330053] vfio-pci 0007:00:00.1: timed out waiting for pending transaction; performing function level reset anyway
> > [ 664.730051] vfio-pci 0007:00:00.1: not ready 1023ms after FLR; waiting
> > [ 665.830023] vfio-pci 0007:00:00.1: not ready 2047ms after FLR; waiting
> > [ 667.910023] vfio-pci 0007:00:00.1: not ready 4095ms after FLR; waiting
> > [ 672.070022] vfio-pci 0007:00:00.1: not ready 8191ms after FLR; waiting
> > [ 680.550025] vfio-pci 0007:00:00.1: not ready 16383ms after FLR; waiting
> > [ 697.190023] vfio-pci 0007:00:00.1: not ready 32767ms after FLR; waiting
> > [ 730.470021] vfio-pci 0007:00:00.1: not ready 65535ms after FLR; giving up
> >
> > The VF here was already dead in the water at that point, so I suspect
> > `pci_read_config_dword()` might have failed, and so this check would have
> > failed, and we wouldn't have "wasted" the minute waiting for a FLR that was
> > never going to happen anyway.
>
> I think maybe we could? I don't think this patch fixes anything that's
> "broken" but rather improves the behavior to escalate to other reset
> method if the device is already in a bad state. I will cc stable.
Right, adding a Fixes tag doesn't really make sense. But it does help with
accelerating recoveries, so maybe a stable tag will work :)
--
Best Regards, Benjamin Block / Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt / Geschäftsführung: David Faller
Sitz der Ges.: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Powered by blists - more mailing lists