netdev - Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHrUA36X2cMZ4WNtRO7NoLaupBNfecLLxDfSVZ4mZzbzHiDjRA@mail.gmail.com>
Date: Sat, 2 Nov 2024 20:24:10 -0500
From: Linas Vepstas <linasvepstas@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Jinjian Song <jinjian.song@...ocom.com>, ryazanov.s.a@...il.com, 
	chandrashekar.devegowda@...el.com, chiranjeevi.rapolu@...ux.intel.com, 
	haijun.liu@...iatek.com, m.chetan.kumar@...ux.intel.com, 
	ricardo.martinez@...ux.intel.com, loic.poulain@...aro.org, 
	johannes@...solutions.net, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, pabeni@...hat.com, angelogioacchino.delregno@...labora.com, 
	bhelgaas@...gle.com, corbet@....net, danielwinkler@...gle.com, 
	korneld@...gle.com, linux-arm-kernel@...ts.infradead.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-mediatek@...ts.infradead.org, matthias.bgg@...il.com, 
	netdev@...r.kernel.org
Subject: Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

Top-post reply.

What Bjorn says is exactly right. Just one clarifying remark: When PCI
error recovery was designed, it was envisioned for high-end,
high-availability servers so that they could reset a failing device
without forcing a reboot of the OS.  Concepts like suspend/resume did
not perturb even the unconscious sleep of the engineers. Thus,
stepping through error recovery during suspend would be novel,
untested, and perhaps prone to confusion. The error recovery procedure
is trying to reset the device into a fully-powered-on,
fully-functional and connected state. This might be problematic if the
suspend has already walked half-way through power-down. The fact that
error recovery might run a very long fraction of a second after the
error is detected further complicates things. The fact that error
recovery  usually has the driver fully reinitializing the device,
including reloading the firmware, could be a problem if the firmware
is not in RAM or is sitting on a suspended block device. So it all
could be an adventure.

As to testing: I asked one of the guys in the lab how he did it, and
he said he would brush a metal wire against the PCI pins. I winced.
But I guess it's cleaner than pouring coffee on it. Later we developed
a software tool to artificially inject errors.

--linas

On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
> > From: Sergey Ryazanov <ryazanov.s.a@...il.com>
> > > On 29.10.2024 05:46, Jinjian Song wrote:
> > > > From: Sergey Ryazanov <ryazanov.s.a@...il.com>
> > > > > On 22.10.2024 11:43, Jinjian Song wrote:
> > > > > > If driver fails to set the device to suspend, it means that the
> > > > > > device is abnormal. In this case, reset the device to recover
> > > > > > when PCIe device is offline.
> > > > >
> > > > > Is it a reproducible or a speculative issue? Does the fix
> > > > > recover modem from a problematic state?
> > > > >
> > > > > Anyway we need someone more familiar with this hardware (Intel
> > > > > or MediaTek engineer) to Ack the change to make sure we are not
> > > > > going to put a system in a more complicated state.
> > > >
> > > > This is a very difficult issue to replicate onece occured and fixed.
> > > >
> > > > The issue occured when driver and device lost the connection. I have
> > > > encountered this problem twice so far:
> > > > 1. During suspend/resume stress test, there was a probabilistic D3L2
> > > > time sequence issue with the BIOS, result in PCIe link down, driver
> > > > read and write the register of device invalid, so suspend failed.
> > > > This issue was eventually fixed in the BIOS and I was able to restore
> > > > it through the reset module after reproducing the problem.
> > > >
> > > > 2. During idle test, the modem probabilistic hang up, result in PCIe
> > > > link down, driver read and write the register of device invalid, so
> > > > suspend failed. This issue was eventually fiex in device modem firmware
> > > > by adjust a certain power supply voltage, and reset modem as a workround
> > > > to restore when the MBIM port command timeout in userspace applycations.
> > > >
> > > > Hardware reset modem to recover was discussed with MTK, and they said
> > > > that if we don't want to keep the on-site problem location in case of
> > > > suspend failure, we can use the recover solution.
> > > > Both the ocurred issues result in the PCIe link issue, driver can't
> > > > read and writer the register of WWAN device, so I want to add this
> > > > path
> > > > to restore, hardware reset modem can recover modem, but using the
> > > > pci_channle_offline() as the judgment is my inference.
> > >
> > > Thank you for the clarification. Let me summarize what I've understood
> > > from the explanation:
> > > a) there were hardware (firmware) issues,
> > > b) issues already were solved,
> > > c) issues were not directly related to the device suspension procedure,
> > > d) you want to implement a backup plan to make the modem support robust.
> > >
> > > If got it right, then I would like to recommend to implement a generic
> > > error handling solution for the PCIe interface. You can check this
> > > document: Documentation/PCI/pci-error-recovery.rst
> >
> > Yes, got it right.
> > I want to identify the scenario and then recover by reset device,
> > otherwise suspend failure will aways prevent the system from suspending
> > if it occurs.
>
> If a PCIe link goes down here's my understanding of what happens:
>
>   - Writes to the device are silently dropped.
>
>   - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).
>
>   - If the device is in a slot and pciehp is enabled, a Data Link
>     Layer State Changed interrupt will cause pciehp_unconfigure_device()
>     to detach the driver and remove the pci_dev.
>
>   - If AER is enabled, a failed access to the device will cause an AER
>     interrupt.  If the driver has registered pci_error_handlers, the
>     driver callbacks will be called, and based on what the driver
>     returns, the PCI core may reset the device.
>
> The pciehp and AER interrupts are *asynchronous* to link down events
> and to any driver access to the device, so they may be delayed an
> arbitrary amount of time.
>
> Both interrupt paths may lead to the device being marked as "offline".
> Obviously this is asynchronous with respect to the driver.
>
> > > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
> > > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
> > > > > > pci_dev *pdev)
> > > > > >       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
> > > > > > ENABLE_ASPM_LOWPWR);
> > > > > >       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
> > > > > >       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
> > > > > > +    if (pci_channel_offline(pdev)) {
> > > > > > +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
> > > > > > +        t7xx_reset_device(t7xx_dev, PLDR);
> > > > > > +    }
>
> This looks like an unreliable way to detect issues.  It only works if
> AER is enabled, and the device is only marked "offline" some arbitrary
> length of time *after* a driver access to the device has failed.
>
> You can't reliably detect errors on writes to the device.
>
> You can only reliably detect errors on reads from the device by
> looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
> to read from some registers, so you need device-specific knowledge to
> know whether ~0 is valid or indicates an error.
>
> If AER or DPC are enabled, the driver can be *notified* about read
> errors and some write errors via pci_error_handlers, but the
> notification is long after the error.
>
> Bjorn



-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.