[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20ac3359-cc0d-7725-fc75-d415bcd4c2fe@amd.com>
Date: Mon, 19 Jun 2023 11:16:35 -0500
From: "Limonciello, Mario" <mario.limonciello@....com>
To: Kai-Heng Feng <kai.heng.feng@...onical.com>,
Bjorn Helgaas <helgaas@...nel.org>
Cc: bhelgaas@...gle.com,
Mika Westerberg <mika.westerberg@...ux.intel.com>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@...ux.intel.com>,
Vidya Sagar <vidyas@...dia.com>,
Michael Bottini <michael.a.bottini@...ux.intel.com>,
"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices
On 6/15/2023 10:01 PM, Kai-Heng Feng wrote:
> On Fri, Jun 16, 2023 at 1:12 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>> On Thu, Jun 15, 2023 at 03:04:20PM +0800, Kai-Heng Feng wrote:
>>> When a PCIe device is hotplugged to a Thunderbolt port, ASPM is not
>>> enabled for that device. However, when the device is plugged preboot,
>>> ASPM is enabled by default.
>>>
>>> The disparity happens because BIOS doesn't have the ability to program
>>> ASPM on hotplugged devices.
>>>
>>> So enable ASPM by default for external connected PCIe devices so ASPM
>>> settings are consitent between preboot and hotplugged.
>>>
>>> On HP Thunderbolt Dock G4, enable ASPM can also fix BadDLLP error:
>>> pcieport 0000:00:1d.0: AER: Corrected error received: 0000:07:04.0
>>> pcieport 0000:07:04.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
>>> pcieport 0000:07:04.0: device [8086:0b26] error status/mask=00000080/00002000
>>> pcieport 0000:07:04.0: [ 7] BadDLLP
>>>
>>> The root cause is still unclear, but quite likely because the I225 on
>>> the dock supports PTM, where ASPM timing is precalculated for the PTM.
>>>
>>> Cc: Mario Limonciello <mario.limonciello@....com>
>>> Cc: Mika Westerberg <mika.westerberg@...ux.intel.com>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217557
>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@...onical.com>
>>> ---
>>> drivers/pci/pcie/aspm.c | 4 +++-
>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
>>> index 66d7514ca111..613b0754c9bb 100644
>>> --- a/drivers/pci/pcie/aspm.c
>>> +++ b/drivers/pci/pcie/aspm.c
>>> @@ -119,7 +119,9 @@ static int policy_to_aspm_state(struct pcie_link_state *link)
>>> /* Enable Everything */
>>> return ASPM_STATE_ALL;
>>> case POLICY_DEFAULT:
>>> - return link->aspm_default;
>>> + return dev_is_removable(&link->downstream->dev) ?
>>> + link->aspm_capable :
>>> + link->aspm_default;
>> I'm a little hesitant because dev_is_removable() is a convenient test
>> that covers the current issue, but it doesn't seem tightly connected
>> from a PCIe architecture perspective.
>>
>> I think the current model of compile-time ASPM policy selection:
>>
>> CONFIG_PCIEASPM_DEFAULT /* BIOS default setting */
>> CONFIG_PCIEASPM_PERFORMANCE /* disable L0s and L1 */
>> CONFIG_PCIEASPM_POWERSAVE /* enable L0s and L1 */
>> CONFIG_PCIEASPM_POWER_SUPERSAVE /* enable L1 substates */
>>
>> is flawed. As far as I know, there's no technical reason we have to
>> select this at kernel build-time. I suspect the original reason was
>> risk avoidance, i.e., we were worried that we might expose hardware
>> defects if we enabled ASPM states that BIOS hadn't already enabled.
>>
>> How do we get out of that model? We do have sysfs knobs that should
>> cover all the functionality (set overall policy as above via
>> /sys/module/pcie_aspm/parameters/policy; set device-level exceptions
>> via /sys/bus/pci/devices/.../link/*_aspm).
> Agree. Build-time policy can be obsoleted by boot-time argument.
I agree as well.
>
>> In my opinion, the cleanest solution would be to enable all ASPM
>> functionality whenever possible and let users disable it if they need
>> to for performance. If there are device defects when something is
>> enabled, deal with it via quirks, as we do for other PCI features.
>>
>> That feels a little risky, but let's have a conversation about where
>> we want to go in the long term. It's good to avoid risk, but too much
>> avoidance leads to its own complexity and an inability to change
>> things.
> I think we should separate the situation into two cases:
> - When BIOS/system firmware has the ability to program ASPM, honor it.
> This applies to most "internal" PCI devices.
> - When BIOS/system can't program ASPM, enable ASPM for whatever it's
> capable of. Most notable case is Intel VMD controller, and this patch
> for devices connected through TBT.
>
> Enabling all ASPM functionality regardless of what's being
> pre-programmed by BIOS is way too risky.
> Disabling ASPM to workaround issues and defects are still quite common
> among hardware manufacturers.
I think the pragmatic way to approach it is to
(essentially) apply the policy as BIOS defaults and
allow overrides from that.
1) Leave a kernel command line option that will force it
off as a "big hammer" for problems.
Remove other ASPM related kernel command line options.
2) Direct users to use the sysfs knobs to override on a
case by case basis for devices to aide in development of
any quirks or new policies.
3) Add quirks and for things like VMD.
The external devices I don't think we should have a blanket
policy. What happens when you put a faulty device into a
USB4 PCIe enclosure? I think this will be a recipe for system
instability.
If the i225 PCIe device generally contained with USB4 docks
is known to behave well with ASPM, how about just quirking it like
VMD would be quirked?
Powered by blists - more mailing lists