[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEg2vzf6tn4j96LG@black.fi.intel.com>
Date: Tue, 10 Jun 2025 16:44:31 +0300
From: Raag Jadav <raag.jadav@...el.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Mario Limonciello <superm1@...nel.org>,
Denis Benato <benato.denis96@...il.com>, mahesh@...ux.ibm.com,
oohall@...il.com, bhelgaas@...gle.com, linux-pci@...r.kernel.org,
linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
ilpo.jarvinen@...ux.intel.com, lukas@...ner.de,
aravind.iddamsetty@...ux.intel.com,
"amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>,
Alex Deucher <alexander.deucher@....com>
Subject: Re: [PATCH v4] PCI: Prevent power state transition of erroneous
device
On Thu, Jun 05, 2025 at 02:26:05PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 5, 2025 at 1:44 PM Raag Jadav <raag.jadav@...el.com> wrote:
> > On Wed, Jun 04, 2025 at 08:19:34PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jun 4, 2025 at 5:43 PM Raag Jadav <raag.jadav@...el.com> wrote:
> > > > On Fri, May 30, 2025 at 07:49:26PM +0200, Rafael J. Wysocki wrote:
> > > > > On Fri, May 30, 2025 at 7:23 PM Raag Jadav <raag.jadav@...el.com> wrote:
> > > > > > On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> > > > > > > On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> > > > > > > > On Wed, May 21, 2025 at 10:54 AM Raag Jadav <raag.jadav@...el.com> wrote:
> > > > > > > > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> > > > > > > > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > > > > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:
> > > >
> > > > ...
> > > >
> > > > > > > > > > From the driver perspective it does have expectations that the parts outside
> > > > > > > > > > the driver did the right thing. If the driver was expecting the root port
> > > > > > > > > > to be powered down at suspend and it wasn't there are hardware components
> > > > > > > > > > that didn't power cycle and that's what we're seeing here.
> > > > > > > > >
> > > > > > > > > Which means the expectation set by the driver is the opposite of the
> > > > > > > > > purpose of this patch, and it's going to fail if any kind of error is
> > > > > > > > > detected under root port during suspend.
> > > > > > > >
> > > > > > > > And IMV this driver's expectation is questionable at least.
> > > > > > > >
> > > > > > > > There is no promise whatsoever that the device will always be put into
> > > > > > > > D3cold during system suspend.
> > > > > > >
> > > > > > > For instance, user space may disable D3cold for any PCI device via the
> > > > > > > d3cold_allowed attribute in sysfs.
> > > > > > >
> > > > > > > If the driver cannot handle this, it needs to be fixed.
> > > > > >
> > > > > > Thanks for confirming. So should we consider this patch to be valid
> > > > > > and worth moving forward?
> > > > >
> > > > > It doesn't do anything that would be invalid in principle IMV.
> > > > >
> > > > > You need to consider one more thing, though: It may be necessary to
> > > > > power-cycle the device in order to kick it out of the erroneous state
> > > > > and the patch effectively blocks this if I'm not mistaken.
> > > > >
> > > > > But admittedly I'm not sure if this really matters.
> > > >
> > > > Wouldn't something like bus reset (SBR) be more predictable?
> > >
> > > Maybe.
> > >
> > > The device state is most likely inconsistent in that case, so it depends.
> >
> > My limited understanding is that if SBR doesn't help, at that point all
> > bets are off including PMCSR configuration and probably a cold boot is
> > needed.
>
> I'm not talking about PMCSR, I'm talking about power removal (D3cold).
> This should be equivalent to a cold boot for the particular device
> except that cold boot would also reset the hierarchy above it.
Sure. But for D3cold we rely on everything else under root port to also
be suspended, which we can't predict while in endpoint suspend path. And
with that we're back to "no guarantees" problem, which was the case either
way. The patch might just prevent any further damage than what's already
done.
Raag
Powered by blists - more mailing lists