lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250207222352.GA949665@bhelgaas>
Date: Fri, 7 Feb 2025 16:23:52 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
Cc: Jan Beulich <jbeulich@...e.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
	Jürgen Groß <jgross@...e.com>,
	Roger Pau Monné <roger.pau@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	xen-devel <xen-devel@...ts.xenproject.org>,
	linux-kernel@...r.kernel.org, regressions@...ts.linux.dev,
	Felix Fietkau <nbd@....name>, Lorenzo Bianconi <lorenzo@...nel.org>,
	Ryder Lee <ryder.lee@...iatek.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Deren Wu <Deren.Wu@...iatek.com>,
	Kai-Heng Feng <kai.heng.feng@...onical.com>,
	Shayne Chen <Shayne.Chen@...iatek.com>,
	Sean Wang <Sean.Wang@...iatek.com>,
	Leon Yen <Leon.Yen@...iatek.com>,
	linux-mediatek@...ts.infradead.org
Subject: Re: Config space access to Mediatek MT7922 doesn't work after device
 reset in Xen PV dom0 (regression, Linux 6.12)

[+cc Alex, Mediatek folks, thread at https://lore.kernel.org/r/Z4pHll_6GX7OUBzQ@mail-itl]

On Wed, Feb 05, 2025 at 11:14:17PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 30, 2025 at 03:31:23PM -0600, Bjorn Helgaas wrote:
> > On Thu, Jan 30, 2025 at 10:30:33AM +0100, Jan Beulich wrote:
> > > On 30.01.2025 05:55, Marek Marczykowski-Górecki wrote:
> > > > I've added logging of all config read/write to this device. Full log at
> > > > [1].
> > > ...

> ...  Generally it looks like this device has broken FLR, and the
> reset works due to the fallback to the secondary bus reset on
> timeout.

> I repeated the test with my additional "&& !PCI_POSSIBLE_ERROR(id)"
> and I got this:
> https://gist.github.com/marmarek/db0808702131b69ea2f66f339a55d71b

I'm having a really hard time piecing this all together.  I'm trying
to recap the current theory:

  - https://github.com/QubesOS/qubes-issues/issues/9689 reports
    Mediatek MT7922 wifi (device/vendor [14c3:0616]) broke when
    running v6.12 on Xen.

  - Marek reproduced this and bisected to d591f6804e7e ("PCI: Wait for
    device readiness with Configuration RRS"), which appeared in
    v6.12.

  - We do FLR on the device, either via sysfs or the xen-pciback
    driver, e.g., pcistub_reset_device_state().

  - We theorize that FLR is unreliable on this device, and it may
    never respond successfully again.  All reads, either to
    PCI_COMMAND (before d591f6804e7e) or PCI_VENDOR_ID (after
    d591f6804e7e) get ~0.

  - Prior to d591f6804e7e, e.g., in v6.11, pci_dev_wait() times out
    because polling PCI_COMMAND always returns ~0, and returns
    -ENOTTY.

    Since -ENOTTY was returned, we try another reset method.  A
    Secondary Bus Reset (SBR) works, and the device works again.

    [3] seems to show this scenario ("NO BUG (kernel rollback 6.11)").
    We waited ~345 seconds before giving up.

  - After d591f6804e7e, e.g., in v6.12, pci_dev_wait() polls
    PCI_VENDOR_ID looking for anything other than 0x0001.  We
    immediately get 0xffff and exit the loop.  We assume the device is
    ready, but it's actually not.

    If pci_dbg were enabled (CONFIG_DYNAMIC_DEBUG=y and booted with
    dyndbg="file drivers/pci/* +p"), we should see "ready %dms after
    FLR" with a very small time.

    We mistakenly think the device is ready, so we restore config
    space, which the device ignores because it's not ready.  The
    device doesn't work at all, perhaps because its config space has
    not been restored.

  - After including the debug patch below, pci_dev_wait() polls
    PCI_VENDOR_ID for something other than either 0x0001 or 0xffff.

    This "works" the same as before d591f6804e7e: We always get ~0,
    eventually time out, return -ENOTTY, fall back to SBR, and the
    device works again.  Because of the timeout, it takes about 70
    seconds in both the Xen and the native logs in [4].

  - The initial report said this works on v6.12 after a warm reboot
    from v6.11, but fails after a cold boot [3].  Followup says this
    works on v6.12 running natively, but it fails when running on
    Xen [5].

    I can't explain why this works in some cases but not others.

  - It seems that even in v6.11, FLR didn't work for this device.  The
    device did eventually become usable, but only because we waited
    70+ seconds after FLR, timed out, and fell back to SBR.

    The quirk patch below should avoid use of FLR completely.  The
    mt7921 driver supports several other devices, maybe more should be
    added.

    Searches for mediatek "not ready" "after FLR" find many similar
    reports from the web: [6], [7] (suspicious in that holding power
    button 60 seconds seems to fix something, maybe similar to the
    warm/cold reboot thing), [8] (works, then fails after
    suspend/resume), [9], [10].

[3] https://github.com/QubesOS/qubes-issues/issues/9689#issuecomment-2582927149
[4] https://gist.github.com/marmarek/db0808702131b69ea2f66f339a55d71b
[5] https://lore.kernel.org/r/Z4pHll_6GX7OUBzQ@mail-itl
[6] https://community.frame.work/t/responded-yet-more-mediatek-issues-on-amd-linux/50039
[7] https://www.linux.org/threads/solved-wifi-adaptator-not-found-mediatek-wi-fi-6-mt7921-wireless-lan-card.37699/page-2
[8] https://forum.manjaro.org/t/mediatek-mt7922-wifi-not-working-after-waking-up/160664
[9] https://forum.manjaro.org/t/mediatek-mt7921e-fails-in-kernel-6-6-and-later-through-6-10/164217
[10] https://www.reddit.com/r/archlinux/comments/188ccib/wifi_disabled_after_disconnected_power/

Debug patch:

  @@ -1297,7 +1297,8 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
                  if (root && root->config_rrs_sv) {
                          pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
  -                     if (!pci_bus_rrs_vendor_id(id))
  +                     if (!pci_bus_rrs_vendor_id(id) &&
  +                         !PCI_POSSIBLE_ERROR(id))
                                  break;


commit 70197d3ec778 ("PCI: Avoid FLR for Mediatek MT7922 WiFi")
Author: Bjorn Helgaas <bhelgaas@...gle.com>
Date:   Fri Feb 7 14:50:42 2025 -0600

    PCI: Avoid FLR for Mediatek MT7922 WiFi
    


diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b84ff7bade82..82b21e34c545 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5522,7 +5522,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
  * AMD Matisse USB 3.0 Host Controller 0x149c
  * Intel 82579LM Gigabit Ethernet Controller 0x1502
  * Intel 82579V Gigabit Ethernet Controller 0x1503
- *
+ * Mediatek MT7922 802.11ax PCI Express Wireless Network Adapter
  */
 static void quirk_no_flr(struct pci_dev *dev)
 {
@@ -5534,6 +5534,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MEDIATEK, 0x0616, quirk_no_flr);
 
 /* FLR may cause the SolidRun SNET DPU (rev 0x1) to hang */
 static void quirk_no_flr_snet(struct pci_dev *dev)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ