[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aKaK4WS0pY0Nb2yi@google.com>
Date: Wed, 20 Aug 2025 19:56:33 -0700
From: Brian Norris <briannorris@...omium.org>
To: Ethan Zhao <etzhao1900@...il.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
> On 8/21/2025 1:26 AM, Brian Norris wrote:
> > From: Brian Norris <briannorris@...gle.com>
> >
> > max_link_speed, max_link_width, current_link_speed, current_link_width,
> > secondary_bus_number, and subordinate_bus_number all access config
> > registers, but they don't check the runtime PM state. If the device is
> > in D3cold, we may see -EINVAL or even bogus values.
> My understanding, if your device is in D3cold, returning of -EINVAL is
> the right behavior.
That's not the guaranteed result though. Some hosts don't properly
return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
we don't even try to hold a pm_runtime reference, the device could
possibly enter D3cold while we're in the middle of reading from it. If
you're lucky, that'll get you a completion timeout and an all-1's
result, and we'll return a garbage result.
So if we want to purposely not resume the device and retain "I can't
give you what you asked for" behavior, we'd at least need a
pm_runtime_get_noresume() or similar.
> > Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> > rest of the similar sysfs attributes.
> >
> > Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> > Cc: stable@...r.kernel.org
> > Signed-off-by: Brian Norris <briannorris@...gle.com>
> > Signed-off-by: Brian Norris <briannorris@...omium.org>
> > ---
> >
> > drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
> > 1 file changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index 5eea14c1f7f5..160df897dc5e 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> > {
> > struct pci_dev *pdev = to_pci_dev(dev);
> > + ssize_t ret;
> > +
> > + pci_config_pm_runtime_get(pdev);
> This function would potentially change the power state of device,
> that would be a complex process, beyond the meaning of
> max_link_speed_show(), given the semantics of these functions (
> max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
> ....),
> this cannot be done !
What makes this different than the 'config' attribute (i.e., "read
config register")? Why shouldn't that just return -EINVAL? I don't
really buy your reasoning -- "it's a complex process" is not a reason
not to do something. The user asked for the link speed; why not give it?
If the user wanted to know if the device was powered, they could check
the 'power_state' attribute instead.
(Side note: these attributes don't show up anywhere in Documentation/,
so it's also a bit hard to declare "best" semantics for them.)
To flip this question around a bit: if I have a system that aggressively
suspends devices when there's no recent activity, how am I supposed to
check what the link speed is? Probabilistically hammer the file while
hoping some other activity wakes the device, so I can find the small
windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
while I check?
Brian
Powered by blists - more mailing lists