[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <da46b882-0cd3-48cd-b4fc-b118b25e1e7e@gmail.com>
Date: Fri, 22 Aug 2025 09:11:25 +0800
From: Ethan Zhao <etzhao1900@...il.com>
To: Brian Norris <briannorris@...omium.org>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
On 8/21/2025 11:28 PM, Brian Norris wrote:
> Hi Ethan,
>
> Note: I'm having a hard time reading your emails sometimes, because you
> aren't really adding in appropriate newlines that separate your reply
> from quoted text. So your own sentences just run together with parts of
> my sentences at times. I've tried to resolve this as best I can.
>
> On Thu, Aug 21, 2025 at 08:41:28PM +0800, Ethan Zhao wrote:
>>
>>
>> On 8/21/2025 10:56 AM, Brian Norris wrote:
>>> On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
>>>> On 8/21/2025 1:26 AM, Brian Norris wrote:
>>>>> From: Brian Norris <briannorris@...gle.com>
>>>>>
>>>>> max_link_speed, max_link_width, current_link_speed, current_link_width,
>>>>> secondary_bus_number, and subordinate_bus_number all access config
>>>>> registers, but they don't check the runtime PM state. If the device is
>>>>> in D3cold, we may see -EINVAL or even bogus values.
>>>> My understanding, if your device is in D3cold, returning of -EINVAL is
>>>> the right behavior.
>>>
>>> That's not the guaranteed result though. Some hosts don't properly
>>> return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
>>> we don't even try to hold a pm_runtime reference, the device could
>>> possibly enter D3cold while we're in the middle of reading from it. If
>>> you're lucky, that'll get you a completion timeout and an all-1's
>>> result, and we'll return a garbage result.
>>>
>>> So if we want to purposely not resume the device and retain "I can't
>>> give you what you asked for" behavior, we'd at least need a
>>> pm_runtime_get_noresume() or similar.
>> I understand you just want the stable result of these caps,
>
> Yes, I'd like a valid result, not EINVAL. Why would I check this file if
> I didn't want the result?
>
>> meanwhile
>> you don't want the side effect either.
>
> Personally, I think side effect is completely fine. Or, it's just as
> fine as it is for the 'config' attribute or for 'resource_N_size'
> attributes that already do the same.
>
>>>>> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
>>>>> rest of the similar sysfs attributes.
>>>>>
>>>>> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
>>>>> Cc: stable@...r.kernel.org
>>>>> Signed-off-by: Brian Norris <briannorris@...gle.com>
>>>>> Signed-off-by: Brian Norris <briannorris@...omium.org>
>>>>> ---
>>>>>
>>>>> drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>>>>> 1 file changed, 29 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>>> index 5eea14c1f7f5..160df897dc5e 100644
>>>>> --- a/drivers/pci/pci-sysfs.c
>>>>> +++ b/drivers/pci/pci-sysfs.c
>>>>> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>>>>> struct device_attribute *attr, char *buf)
>>>>> {
>>>>> struct pci_dev *pdev = to_pci_dev(dev);
>>>>> + ssize_t ret;
>>>>> +
>>>>> + pci_config_pm_runtime_get(pdev);
>>>> This function would potentially change the power state of device,
>>>> that would be a complex process, beyond the meaning of
>>>> max_link_speed_show(), given the semantics of these functions (
>>>> max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
>>>> ....),
>>>> this cannot be done !
>>>
>>> What makes this different than the 'config' attribute (i.e., "read
>>> config register")? Why shouldn't that just return -EINVAL? I don't
>>> really buy your reasoning -- "it's a complex process" is not a reason
>> It is a reason to know there is side effect to be taken into account.
>
> OK, agreed, there's a side effect. I don't think you've convinced me the
> side effect is bad though.
>
>>> not
>>> to do something. The user asked for the link speed; why not give it?
>>> If the user wanted to know if the device was powered, they could check
>>> the 'power_state' attribute instead.
>>>
>>> (Side note: these attributes don't show up anywhere in Documentation/,
>>> so it's also a bit hard to declare "best" semantics for them.)
>>>
>>> To flip this question around a bit: if I have a system that aggressively
>>> suspends devices when there's no recent activity, how am I supposed to
>>> check what the link speed is? Probabilistically hammer the file while
>>> hoping some other activity wakes the device, so I can find the small
>>> windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
>>> while I check?
>> Hold a PM reference by pci_config_pm_runtime_get() and then write some
>> data to the PCIe config space, no objection.
>>
>> To know about the linkspeed etc capabilities/not status, how about
>> creating a cached version of these caps, no need to change their
>> power state.
>
> For static values like the "max" attributes, maybe that's fine.
>
> But Linux is not always the one changing the link speed. I've seen PCI
> devices that autonomously request link-speed changes, and AFAICT, the
> only way we'd know in host software is to go reread the config
> registers. So caching just produces cache invalidation problems.
Maybe you meant the link-speed status, that would be volatile based on
link retraining.
Here we are talking about some non-volatile capabilities value no
invalidation needed to their cached variables.>
>> If there is aggressive power saving requirement, and the polling
>> of these caps will make up wakeup/poweron bugs.
>
> If you're worried about wakeup frequency, I think that's a matter of
> user space / system administraction to decide -- if it doesn't want to
> potentially wake up the link, it shouldn't be poking at config-based
IMHO, sysfs interface is part of KABI, you change its behavior , you
definitely would break some running binaries. there is alternative
way to avoid re-cooking binaries or waking up administrator to modify
their configuration/script in the deep night. you already got it.
Thanks,
Ethan > sysfs attributes.
>
> Brian
Powered by blists - more mailing lists