[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca2b8c13-a7d8-e157-fd1a-770ee8cb10c1@amd.com>
Date: Wed, 17 Jul 2024 14:14:27 -0700
From: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Cc: Lukas Wunner <lukas@...ner.de>, Bjorn Helgaas <helgaas@...nel.org>,
linux-pci@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Yazen Ghannam <yazen.ghannam@....com>, Bowman Terry <terry.bowman@....com>,
Hagan Billy <billy.hagan@....com>, Simon Guinot <simon.guinot@...gate.com>,
"Maciej W . Rozycki" <macro@...am.me.uk>,
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
Subject: Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link
speed reduction
On 7/9/2024 3:52 AM, Ilpo Järvinen wrote:
> On Tue, 25 Jun 2024, Smita Koralahalli wrote:
>
>> Sorry for the delay here. Took some time to find a system to run experiments.
>> Comments inline.
>>
>> On 6/19/2024 12:47 AM, Lukas Wunner wrote:
>>> On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
>>>> On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
>>>>>>>> But IIUC LBMS is set by hardware but never cleared by hardware, so
>>>>>>>> if
>>>>>>>> we remove a device and power off the slot, it doesn't seem like
>>>>>>>> LBMS
>>>>>>>> could be telling us anything useful (what could we do in response
>>>>>>>> to
>>>>>>>> LBMS when the slot is empty?), so it makes sense to me to clear
>>>>>>>> it.
>>>>>>>>
>>>>>>>> It seems like pciehp_unconfigure_device() does sort of PCI core
>>>>>>>> and
>>>>>>>> driver-related things and possibly could be something shared by
>>>>>>>> all
>>>>>>>> hotplug drivers, while remove_board() does things more specific to
>>>>>>>> the
>>>>>>>> hotplug model (pciehp, shpchp, etc).
>>>>>>>>
>>>>>>>> From that perspective, clearing LBMS might fit better in
>>>>>>>> remove_board(). In that case, I wonder whether it should be done
>>>>>>>> after turning off slot power? This patch clears is *before*
>>>>>>>> turning
>>>>>>>> off the power, so I wonder if hardware could possibly set it again
>>>>>>>> before the poweroff?
>>>>
>>>> While clearing LBMS in remove_board() here:
>>>>
>>>> if (POWER_CTRL(ctrl)) {
>>>> pciehp_power_off_slot(ctrl);
>>>> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>>>> PCI_EXP_LNKSTA_LBMS);
>>>>
>>>> /*
>>>> * After turning power off, we must wait for at least 1 second
>>>> * before taking any action that relies on power having been
>>>> * removed from the slot/adapter.
>>>> */
>>>> msleep(1000);
>>>>
>>>> /* Ignore link or presence changes caused by power off */
>>>> atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
>>>> &ctrl->pending_events);
>>>> }
>>>>
>>>> This can happen too right? I.e Just after the slot poweroff and before
>>>> LBMS
>>>> clearing the PDC/PDSC could be fired. Then
>>>> pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
>>>> proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
>>>> ultimately link speed drops..
>>>>
>>>> So, I added clearing just before turning off the slot.. Let me know if I'm
>>>> thinking it right.
>>
>> I guess I should have experimented before putting this comment out.
>>
>> After talking to the HW/FW teams, I understood that, none of our CRBs support
>> power controller for NVMe devices, which means the "Power Controller Present"
>> in Slot_Cap is always false. That's what makes it a "surprise removal." If the
>> OS was notified beforehand and there was a power controller attached, the OS
>> would turn off the power with SLOT_CNTL. That's an "orderly" removal. So
>> essentially, the entire block from "if (POWER_CTRL(ctrl))" will never be
>> executed for surprise removal for us.
>>
>> There could be board designs outside of us, with power controllers for the
>> NVME devices, which I'm not aware of.
>>>
>>> This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
>>> after powering off a slot"). You can try reproducing it by writing "0"
>>> to the slot's "power" file in sysfs, but your hardware needs to support
>>> slot power.
>>>
>>> Basically the idea is that after waiting for 1 sec, chances are very low
>>> that any DLLSC or PDSC events caused by removing slot power may still
>>> occur.
>>
>> PDSC events occurring in our case aren't by removing slot power. It
>> should/will always happen on a surprise removal along with DLLSC for us. But
>> this PDSC is been delayed and happens after DLLSC is invoked and ctrl->state =
>> OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook to enable slot in
>> pciehp_enable_slot() inside pciehp_handle_presence_or_link_change().
>>>
>>> Arguably the same applies to LBMS changes, so I'd recommend to likewise
>>> clear stale LBMS after the msleep(1000).
>>>
>>> pciehp_ctrl.c only contains the state machine and higher-level logic of
>>> the hotplug controller and all the actual register accesses are in helpers
>>> in pciehp_hpc.c. So if you want to do it picture-perfectly, add a helper
>>> in pciehp_hpc.c to clear LBMS and call that from remove_board().
>>>
>>> That all being said, I'm wondering how this plays together with Ilpo's
>>> bandwidth control driver?
>>>
>>> https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/
>>
>> I need to yet do a thorough reading of Ilpo's bandwidth control driver. Ilpo
>> please correct me if I misspeak something as I don't have a thorough
>> understanding.
>>
>> Ilpo's bandwidth controller also checks for lbms count to be greater than zero
>> to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If false, it follows
>> the default path to check LBMS bit in link status register. So if,
>> CONFIG_PCIE_BWCTRL is disabled by default we continue to see link speed drops.
>> Even, if BWCTRL is enabled, LBMS count is incremented to 1 in
>> pcie_bwnotif_enable() so likely pcie_lbms_seen() might return true thereby
>> bringing down speeds here as well if DLLLA is clear?
>
> I did add code to clear the LBMS count in pciehp_unconfigure_device() in
> part thanks to this patch of yours. Do you think it wouldn't work?
It works with BWCTRL enabled. Just my concern would be to keep the
clearing in pciehp_unconfigure_device() and not do it inside
POWER_CTRL(ctrl), in remove_board() as per the suggestions given above.
>
> But I agree there would still be problem if BWCTRL is not enabled. I
> already have to keep part of it enabled due to the Target Speed quirk
> and now this is another case where just having it always on would be
> beneficial.
Correct, it should always be on to not see the problem.
Would like to have this "LBMS clearing fix" accepted in sooner since its
breaking things on our systems! :)
Thanks
Smita.
>
>>> IIUC, the bandwidth control driver will be in charge of handling LBMS
>>> changes. So clearing LBMS behind the bandwidth control driver's back
>>> might be problematic. Ilpo?
>
> Yes, BW controller will take control of LBMS and other code should not
> touch it directly (and LBMS will be kept cleared by the BW controller).
> However, in this case I'll just need to adapt the code to replace the
> LBMS clearing with resetting the LBMS count (if this patch is accepted
> before BW controller), the resetting is already there anyway.
>
>>> Also, since you've confirmed that this issue is fallout from
>>> a89c82249c37 ("PCI: Work around PCIe link training failures"),
>>> I'm wondering if the logic introduced by that commit can be
>>> changed so that the quirk is applied more narrowly, i.e. *not*
>>> applied to unaffected hardware, such as AMD's hotplug ports.
>>> That would avoid the need to undo the effect of the quirk and
>>> work around the downtraining you're seeing.
>>>
>>> Maciej, any ideas?
>>
>> Yeah I'm okay to go down to that approach as well. Any ideas would be helpful
>> here.
>
> One thing I don't like in the Target Speed quirk is that it leaves the
> Link Speed into the lower value if the quirk fails to bring the link up,
> the quirk could restore the original Link Speed on failure to avoid these
> problems. I even suggested that earlier, however, the downside of
> restoring the original Link Speed is that it will require triggering yet
> another retraining (perhaps we could avoid waiting for its completion
> though since we expect it to fail).
>
> It might be possible to eventually trigger the Target Speed quirk from the
> BW controller but it would require writing some state machine so that the
> quirk is not repeatedly attempted. It seemed to complicate things too much
> to add such a state machine at this point.
>
Powered by blists - more mailing lists