lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241031170428.GA1249507@bhelgaas>
Date: Thu, 31 Oct 2024 12:04:28 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Jinjian Song <jinjian.song@...ocom.com>
Cc: ryazanov.s.a@...il.com, chandrashekar.devegowda@...el.com,
	chiranjeevi.rapolu@...ux.intel.com, haijun.liu@...iatek.com,
	m.chetan.kumar@...ux.intel.com, ricardo.martinez@...ux.intel.com,
	loic.poulain@...aro.org, johannes@...solutions.net,
	davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
	pabeni@...hat.com, angelogioacchino.delregno@...labora.com,
	bhelgaas@...gle.com, corbet@....net, danielwinkler@...gle.com,
	korneld@...gle.com, linasvepstas@...il.com,
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mediatek@...ts.infradead.org,
	matthias.bgg@...il.com, netdev@...r.kernel.org
Subject: Re: [net-next v2] net: wwan: t7xx: reset device if suspend fails

On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
> From: Sergey Ryazanov <ryazanov.s.a@...il.com>
> > On 29.10.2024 05:46, Jinjian Song wrote:
> > > From: Sergey Ryazanov <ryazanov.s.a@...il.com>
> > > > On 22.10.2024 11:43, Jinjian Song wrote:
> > > > > If driver fails to set the device to suspend, it means that the
> > > > > device is abnormal. In this case, reset the device to recover
> > > > > when PCIe device is offline.
> > > > 
> > > > Is it a reproducible or a speculative issue? Does the fix
> > > > recover modem from a problematic state?
> > > > 
> > > > Anyway we need someone more familiar with this hardware (Intel
> > > > or MediaTek engineer) to Ack the change to make sure we are not
> > > > going to put a system in a more complicated state.
> > > 
> > > This is a very difficult issue to replicate onece occured and fixed.
> > > 
> > > The issue occured when driver and device lost the connection. I have
> > > encountered this problem twice so far:
> > > 1. During suspend/resume stress test, there was a probabilistic D3L2
> > > time sequence issue with the BIOS, result in PCIe link down, driver
> > > read and write the register of device invalid, so suspend failed.
> > > This issue was eventually fixed in the BIOS and I was able to restore
> > > it through the reset module after reproducing the problem.
> > > 
> > > 2. During idle test, the modem probabilistic hang up, result in PCIe
> > > link down, driver read and write the register of device invalid, so
> > > suspend failed. This issue was eventually fiex in device modem firmware
> > > by adjust a certain power supply voltage, and reset modem as a workround
> > > to restore when the MBIM port command timeout in userspace applycations.
> > > 
> > > Hardware reset modem to recover was discussed with MTK, and they said
> > > that if we don't want to keep the on-site problem location in case of
> > > suspend failure, we can use the recover solution.
> > > Both the ocurred issues result in the PCIe link issue, driver can't
> > > read and writer the register of WWAN device, so I want to add this
> > > path
> > > to restore, hardware reset modem can recover modem, but using the
> > > pci_channle_offline() as the judgment is my inference.
> > 
> > Thank you for the clarification. Let me summarize what I've understood
> > from the explanation:
> > a) there were hardware (firmware) issues,
> > b) issues already were solved,
> > c) issues were not directly related to the device suspension procedure,
> > d) you want to implement a backup plan to make the modem support robust.
> > 
> > If got it right, then I would like to recommend to implement a generic
> > error handling solution for the PCIe interface. You can check this
> > document: Documentation/PCI/pci-error-recovery.rst
> 
> Yes, got it right.
> I want to identify the scenario and then recover by reset device,
> otherwise suspend failure will aways prevent the system from suspending
> if it occurs.

If a PCIe link goes down here's my understanding of what happens:

  - Writes to the device are silently dropped.

  - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).

  - If the device is in a slot and pciehp is enabled, a Data Link
    Layer State Changed interrupt will cause pciehp_unconfigure_device()
    to detach the driver and remove the pci_dev.

  - If AER is enabled, a failed access to the device will cause an AER
    interrupt.  If the driver has registered pci_error_handlers, the
    driver callbacks will be called, and based on what the driver
    returns, the PCI core may reset the device.

The pciehp and AER interrupts are *asynchronous* to link down events
and to any driver access to the device, so they may be delayed an
arbitrary amount of time.

Both interrupt paths may lead to the device being marked as "offline".
Obviously this is asynchronous with respect to the driver.

> > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
> > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
> > > > > pci_dev *pdev)
> > > > >       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
> > > > > ENABLE_ASPM_LOWPWR);
> > > > >       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
> > > > >       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
> > > > > +    if (pci_channel_offline(pdev)) {
> > > > > +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
> > > > > +        t7xx_reset_device(t7xx_dev, PLDR);
> > > > > +    }

This looks like an unreliable way to detect issues.  It only works if
AER is enabled, and the device is only marked "offline" some arbitrary
length of time *after* a driver access to the device has failed.

You can't reliably detect errors on writes to the device.

You can only reliably detect errors on reads from the device by
looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
to read from some registers, so you need device-specific knowledge to
know whether ~0 is valid or indicates an error.

If AER or DPC are enabled, the driver can be *notified* about read
errors and some write errors via pci_error_handlers, but the
notification is long after the error.

Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ