netdev - Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1657857c-7761-4fea-aad2-f26c69473895@gmail.com>
Date: Mon, 4 Nov 2024 00:02:45 +0200
From: Sergey Ryazanov <ryazanov.s.a@...il.com>
To: linasvepstas@...il.com, Bjorn Helgaas <helgaas@...nel.org>
Cc: Jinjian Song <jinjian.song@...ocom.com>,
 chandrashekar.devegowda@...el.com, chiranjeevi.rapolu@...ux.intel.com,
 haijun.liu@...iatek.com, m.chetan.kumar@...ux.intel.com,
 ricardo.martinez@...ux.intel.com, loic.poulain@...aro.org,
 johannes@...solutions.net, davem@...emloft.net, edumazet@...gle.com,
 kuba@...nel.org, pabeni@...hat.com, angelogioacchino.delregno@...labora.com,
 bhelgaas@...gle.com, corbet@....net, danielwinkler@...gle.com,
 korneld@...gle.com, linux-arm-kernel@...ts.infradead.org,
 linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-mediatek@...ts.infradead.org, matthias.bgg@...il.com,
 netdev@...r.kernel.org
Subject: Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

Hello Bjorn, Linas,

many thanks for clarifying this tricky topic! Hope this information can 
serve a good starting point for Jinjian to develop a proper solution for 
that case.

On 03.11.2024 03:24, Linas Vepstas wrote:
> Top-post reply.
> 
> What Bjorn says is exactly right. Just one clarifying remark: When PCI
> error recovery was designed, it was envisioned for high-end,
> high-availability servers so that they could reset a failing device
> without forcing a reboot of the OS.  Concepts like suspend/resume did
> not perturb even the unconscious sleep of the engineers. Thus,
> stepping through error recovery during suspend would be novel,
> untested, and perhaps prone to confusion. The error recovery procedure
> is trying to reset the device into a fully-powered-on,
> fully-functional and connected state. This might be problematic if the
> suspend has already walked half-way through power-down. The fact that
> error recovery might run a very long fraction of a second after the
> error is detected further complicates things. The fact that error
> recovery  usually has the driver fully reinitializing the device,
> including reloading the firmware, could be a problem if the firmware
> is not in RAM or is sitting on a suspended block device. So it all
> could be an adventure.
> 
> As to testing: I asked one of the guys in the lab how he did it, and
> he said he would brush a metal wire against the PCI pins. I winced.
> But I guess it's cleaner than pouring coffee on it. Later we developed
> a software tool to artificially inject errors.
> 
> On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>> On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
>>> From: Sergey Ryazanov <ryazanov.s.a@...il.com>
>>>> On 29.10.2024 05:46, Jinjian Song wrote:
>>>>> From: Sergey Ryazanov <ryazanov.s.a@...il.com>
>>>>>> On 22.10.2024 11:43, Jinjian Song wrote:
>>>>>>> If driver fails to set the device to suspend, it means that the
>>>>>>> device is abnormal. In this case, reset the device to recover
>>>>>>> when PCIe device is offline.
>>>>>>
>>>>>> Is it a reproducible or a speculative issue? Does the fix
>>>>>> recover modem from a problematic state?
>>>>>>
>>>>>> Anyway we need someone more familiar with this hardware (Intel
>>>>>> or MediaTek engineer) to Ack the change to make sure we are not
>>>>>> going to put a system in a more complicated state.
>>>>>
>>>>> This is a very difficult issue to replicate onece occured and fixed.
>>>>>
>>>>> The issue occured when driver and device lost the connection. I have
>>>>> encountered this problem twice so far:
>>>>> 1. During suspend/resume stress test, there was a probabilistic D3L2
>>>>> time sequence issue with the BIOS, result in PCIe link down, driver
>>>>> read and write the register of device invalid, so suspend failed.
>>>>> This issue was eventually fixed in the BIOS and I was able to restore
>>>>> it through the reset module after reproducing the problem.
>>>>>
>>>>> 2. During idle test, the modem probabilistic hang up, result in PCIe
>>>>> link down, driver read and write the register of device invalid, so
>>>>> suspend failed. This issue was eventually fiex in device modem firmware
>>>>> by adjust a certain power supply voltage, and reset modem as a workround
>>>>> to restore when the MBIM port command timeout in userspace applycations.
>>>>>
>>>>> Hardware reset modem to recover was discussed with MTK, and they said
>>>>> that if we don't want to keep the on-site problem location in case of
>>>>> suspend failure, we can use the recover solution.
>>>>> Both the ocurred issues result in the PCIe link issue, driver can't
>>>>> read and writer the register of WWAN device, so I want to add this
>>>>> path
>>>>> to restore, hardware reset modem can recover modem, but using the
>>>>> pci_channle_offline() as the judgment is my inference.
>>>>
>>>> Thank you for the clarification. Let me summarize what I've understood
>>>> from the explanation:
>>>> a) there were hardware (firmware) issues,
>>>> b) issues already were solved,
>>>> c) issues were not directly related to the device suspension procedure,
>>>> d) you want to implement a backup plan to make the modem support robust.
>>>>
>>>> If got it right, then I would like to recommend to implement a generic
>>>> error handling solution for the PCIe interface. You can check this
>>>> document: Documentation/PCI/pci-error-recovery.rst
>>>
>>> Yes, got it right.
>>> I want to identify the scenario and then recover by reset device,
>>> otherwise suspend failure will aways prevent the system from suspending
>>> if it occurs.
>>
>> If a PCIe link goes down here's my understanding of what happens:
>>
>>    - Writes to the device are silently dropped.
>>
>>    - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).
>>
>>    - If the device is in a slot and pciehp is enabled, a Data Link
>>      Layer State Changed interrupt will cause pciehp_unconfigure_device()
>>      to detach the driver and remove the pci_dev.
>>
>>    - If AER is enabled, a failed access to the device will cause an AER
>>      interrupt.  If the driver has registered pci_error_handlers, the
>>      driver callbacks will be called, and based on what the driver
>>      returns, the PCI core may reset the device.
>>
>> The pciehp and AER interrupts are *asynchronous* to link down events
>> and to any driver access to the device, so they may be delayed an
>> arbitrary amount of time.
>>
>> Both interrupt paths may lead to the device being marked as "offline".
>> Obviously this is asynchronous with respect to the driver.
>>
>>>>>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
>>>>>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
>>>>>>> pci_dev *pdev)
>>>>>>>        iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
>>>>>>> ENABLE_ASPM_LOWPWR);
>>>>>>>        atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>>>>>>>        t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
>>>>>>> +    if (pci_channel_offline(pdev)) {
>>>>>>> +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
>>>>>>> +        t7xx_reset_device(t7xx_dev, PLDR);
>>>>>>> +    }
>>
>> This looks like an unreliable way to detect issues.  It only works if
>> AER is enabled, and the device is only marked "offline" some arbitrary
>> length of time *after* a driver access to the device has failed.
>>
>> You can't reliably detect errors on writes to the device.
>>
>> You can only reliably detect errors on reads from the device by
>> looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
>> to read from some registers, so you need device-specific knowledge to
>> know whether ~0 is valid or indicates an error.
>>
>> If AER or DPC are enabled, the driver can be *notified* about read
>> errors and some write errors via pci_error_handlers, but the
>> notification is long after the error.

--
Sergey