[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOFcj8SLxv7qX5_i5DJ0YScG0EVkWFO5Qj-eMfzo_xpW5ziwQg@mail.gmail.com>
Date: Thu, 15 Jan 2026 16:43:12 -0800
From: Zac Bowling <zbowling@...il.com>
To: Sean Wang <sean.wang@...nel.org>
Cc: deren.wu@...iatek.com, kvalo@...nel.org, linux-kernel@...r.kernel.org,
linux-mediatek@...ts.infradead.org, linux-wireless@...r.kernel.org,
lorenzo@...nel.org, nbd@....name, ryder.lee@...iatek.com,
sean.wang@...iatek.com
Subject: Re: [PATCH v3 00/17] wifi: mt76: mt7925/mt792x: comprehensive
stability fixes
Hi Sean,
Thanks for testing this and catching that WARN. Good catch. Yeah, that
was my bug. One of my attempts to handle all error returns that my
static analyzer said was unhandled meant I didn't actually hit a
required callback because I early returned too soon. Patched it
locally already and it's my repo. Will send in just a sec after my
poor-mans stress finishes running tests. I found another bug this
morning too, I need to send with device resets coming out of suspend
and corrupted list from the past initialization.
Zac Bowling
On Thu, Jan 15, 2026 at 4:15 PM Sean Wang <sean.wang@...nel.org> wrote:
>
> Hi Zac,
>
> Thanks for sharing this series. Overall the patches look good to me,
> and I’m continuing more testing to ensure there are no regressions on
> mt7925 and mt7921 further
> But today I do hit a kernel WARN in the disconnect path (mac80211 BA
> session teardown) while testing v3 of the series
>
> [ 3373.120224] Hardware name: HP HP EliteBook 830 G6/854A, BIOS R70
> Ver. 01.22.00 10/14/2022
> [ 3373.120228] Workqueue: events_unbound cfg80211_wiphy_work [cfg80211]
> [ 3373.120367] RIP: 0010:__ieee80211_stop_tx_ba_session+0x295/0x350 [mac80211]
> [ 3373.120570] Code: 11 0f 83 a3 00 00 00 48 c7 80 90 03 00 00 00 00
> 00 00 48 8b 7d 98 e8 4a 26 f3 fa 4c 89 ee 4c 89 ef e8 6f 16 0b fa 31
> c0 eb 93 <0f> 0b 31 c0 eb 8d b8 8e ff ff ff eb 86 48 8b 7d 98 e8 25 26
> f3 fa
> [ 3373.120576] RSP: 0018:ffffd00902ed7ba0 EFLAGS: 00010206
> [ 3373.120583] RAX: 0000000000010003 RBX: 0000000000000003 RCX: 0000000000000000
> [ 3373.120587] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [ 3373.120591] RBP: ffffd00902ed7c10 R08: 0000000000000000 R09: 0000000000000000
> [ 3373.120596] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 3373.120599] R13: ffff8a8433717540 R14: ffff8a83e0b20960 R15: ffff8a834d42c000
> [ 3373.120604] FS: 0000000000000000(0000) GS:ffff8a8477b03000(0000)
> knlGS:0000000000000000
> [ 3373.120608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3373.120626] CR2: 00007b9e0a8ba0d0 CR3: 000000009a440005 CR4: 00000000003726f0
> [ 3373.120631] Call Trace:
> [ 3373.120656] <TASK>
> [ 3373.120664] ieee80211_sta_tear_down_BA_sessions+0x53/0xe0 [mac80211]
> [ 3373.120836] __sta_info_destroy_part1+0x48/0x550 [mac80211]
> [ 3373.120994] __sta_info_flush+0x10e/0x230 [mac80211]
> [ 3373.121150] ieee80211_set_disassoc+0x6b3/0x900 [mac80211]
> [ 3373.121293] ? _printk+0x5f/0x90
> [ 3373.121330] __ieee80211_disconnect+0xd6/0x1a0 [mac80211]
> [ 3373.121446] ieee80211_beacon_connection_loss_work+0x6d/0xc0 [mac80211]
> [ 3373.121573] cfg80211_wiphy_work+0xb4/0x190 [cfg80211]
> [ 3373.121779] process_one_work+0x191/0x3e0
> [ 3373.121789] worker_thread+0x2e3/0x420
> [ 3373.121796] ? __pfx_worker_thread+0x10/0x10
> [ 3373.121802] kthread+0x10d/0x230
> [ 3373.121810] ? __pfx_kthread+0x10/0x10
> [ 3373.121818] ret_from_fork+0x205/0x230
> [ 3373.121826] ? __pfx_kthread+0x10/0x10
> [ 3373.121832] ret_from_fork_asm+0x1a/0x30
> [ 3373.121842] </TASK>
> [ 3373.121844] ---[ end trace 0000000000000000 ]---
> [ 3373.128750] ------------[ cut here ]------------
> [ 3373.128757] WARNING: CPU: 1 PID: 14854 at net/mac80211/agg-tx.c:398
> __ieee80211_stop_tx_ba_session+0x295/0x350 [mac80211]
>
> I’m currently bisecting the series to identify which patch triggers it
> and will follow up once I have clearer results.
> Thanks again for the work and the DKMS setup.
>
> Sean
>
> On Sun, Jan 4, 2026 at 6:27 PM Zac Bowling <zbowling@...il.com> wrote:
> >
> > From: Zac Bowling <zac@...bowling.com>
> >
> > This patch series addresses kernel panics, system deadlocks, and various
> > stability issues in the MT7925 WiFi driver. The issues were discovered on
> > kernel 6.17 (Ubuntu 25.10) and fixes were developed and tested on 6.18.2.
> >
> > These patches are based on the wireless tree (nbd168/wireless.git) as
> > requested by Sean Wang.
> >
> > == Problem Description ==
> >
> > The MT7925 driver has several bugs that cause:
> > - Kernel NULL pointer dereferences during BSSID roaming
> > - System-wide deadlocks requiring hard reboot
> > - Firmware reload failures after suspend/resume
> > - Key removal errors during MLO roaming
> >
> > These issues manifest approximately every 5 minutes when the adapter
> > tries to switch to a better BSSID, particularly in enterprise environments
> > with multiple access points.
> >
> > == Root Causes ==
> >
> > 1. Missing mutex protection around ieee80211_iterate_active_interfaces()
> > when the callback invokes MCU functions (patches 2, 3, 16)
> >
> > 2. NULL pointer dereferences where mt792x_vif_to_bss_conf(),
> > mt792x_sta_to_link(), and similar functions return NULL during
> > MLO state transitions but results are not checked (patches 1, 4, 5,
> > 9, 10, 14, 17)
> >
> > 3. Ignored MCU return values hiding firmware errors (patches 6, 7, 8)
> >
> > 4. WARN_ON_ONCE used where NULL is expected during normal MLO AP
> > setup (patch 13)
> >
> > 5. Firmware semaphore not released after failed load attempts (patch 15)
> >
> > 6. Key removal returning error when link is already torn down (patch 12)
> >
> > == Testing ==
> >
> > Stress tested by hammering the driver with custom test script.
> >
> > Tested on:
> > - Framework Desktop (AMD Ryzen AI Max 300 Series) with MT7925 (RZ717)
> > - This whole patch series was tested on Kernel 6.18.2 and 6.17.12 (Ubuntu 25.10)
> > - Enterprise WiFi environment with multiple WIFI 7 APs with MLO enabled
> >
> > Before patches: System hangs/panics every 5-15 minutes during BSSID roaming
> > After patches: Stable for 24+ hours under continuous stress testing
> >
> > == Crash Traces Fixed ==
> >
> > Primary NULL pointer dereference:
> > BUG: kernel NULL pointer dereference, address: 0000000000000010
> > Workqueue: mt76 mt7925_mac_reset_work [mt7925_common]
> > RIP: 0010:mt76_connac_mcu_uni_add_dev+0x9c/0x780 [mt76_connac_lib]
> > Call Trace:
> > mt7925_vif_connect_iter+0xcb/0x240 [mt7925_common]
> > __iterate_interfaces+0x92/0x130 [mac80211]
> > ieee80211_iterate_interfaces+0x3d/0x60 [mac80211]
> > mt7925_mac_reset_work+0x105/0x190 [mt7925_common]
> >
> > Deadlock trace:
> > INFO: task kworker/u128:0:48737 blocked for more than 122 seconds.
> > Workqueue: mt76 mt7925_mac_reset_work [mt7925_common]
> > Call Trace:
> > __mutex_lock.constprop.0+0x3d0/0x6d0
> > mt7925_mac_reset_work+0x85/0x170 [mt7925_common]
> >
> > == Related Links ==
> >
> > Framework Community discussion:
> > https://community.frame.work/t/kernel-panic-from-wifi-mediatek-mt7925-nullptr-dereference/79301
> >
> > OpenWrt GitHub issues:
> > https://github.com/openwrt/mt76/issues/1014
> > https://github.com/openwrt/mt76/issues/1036
> >
> > GitHub repository with additional analysis:
> > https://github.com/zbowling/mt7925
> >
> > Zac Bowling (17):
> > wifi: mt76: mt7925: fix NULL pointer dereference in vif iteration
> > wifi: mt76: mt7925: fix missing mutex protection in reset and ROC abort
> > wifi: mt76: mt7925: fix missing mutex protection in runtime PM and MLO PM
> > wifi: mt76: mt7925: add NULL checks in MCU STA TLV functions
> > wifi: mt76: mt7925: add NULL checks for link_conf and mlink in main.c
> > wifi: mt76: mt7925: add error handling for AMPDU MCU commands
> > wifi: mt76: mt7925: add error handling for BSS info MCU command in sta_add
> > wifi: mt76: mt7925: add error handling for BSS info in key setup
> > wifi: mt76: mt7925: add NULL checks in MLO link and chanctx functions
> > wifi: mt76: mt792x: fix NULL pointer dereference in TX path
> > wifi: mt76: mt7925: add lockdep assertions for mutex verification
> > wifi: mt76: mt7925: fix key removal failure during MLO roaming
> > wifi: mt76: mt7925: fix kernel warning in MLO ROC setup
> > wifi: mt76: mt7925: add NULL checks for MLO link pointers in MCU functions
> > wifi: mt76: mt792x: fix firmware reload failure after previous load crash
> > wifi: mt76: mt7925: add mutex protection in resume path
> > wifi: mt76: mt7925: add NULL checks in link station and TX queue setup
> >
> > drivers/net/wireless/mediatek/mt76/mt792x_core.c | 27 +++++++++++++++-
> > drivers/net/wireless/mediatek/mt76/mt7925/mac.c | 8 +++++
> > drivers/net/wireless/mediatek/mt76/mt7925/main.c | 95 +++++++++++++++++++++---
> > drivers/net/wireless/mediatek/mt76/mt7925/mcu.c | 52 ++++++++++++++---
> > drivers/net/wireless/mediatek/mt76/mt7925/pci.c | 6 +++
> > 5 files changed, 170 insertions(+), 18 deletions(-)
> >
> > --
> > 2.51.0
> >
Powered by blists - more mailing lists