[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnSHcZttq79cJS3l@google.com>
Date: Thu, 20 Jun 2024 12:48:01 -0700
From: Brian Norris <briannorris@...omium.org>
To: Kalle Valo <kvalo@...nel.org>
Cc: Sascha Hauer <s.hauer@...gutronix.de>, linux-wireless@...r.kernel.org,
linux-kernel@...r.kernel.org, David Lin <yu-hao.lin@....com>,
Francesco Dolcini <francesco@...cini.it>
Subject: Re: [PATCH] [RFC] mwifiex: Fix NULL pointer deref
Hi Sascha,
On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote:
> Sascha Hauer <s.hauer@...gutronix.de> writes:
>
> > When an Access Point is repeatedly started it happens that the
> > interrupts handler is called with priv->wdev.wiphy being NULL, but
> > dereferenced in mwifiex_parse_single_response_buf() resulting in:
> >
> > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
...
> > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> > | sp : ffff8000818b3a70
> > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> > | Call trace:
> > | mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> > | mwifiex_process_event+0x110/0x238 [mwifiex]
> > | mwifiex_main_process+0x428/0xa44 [mwifiex]
> > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> > | process_sdio_pending_irqs+0x64/0x1b8
> > | sdio_irq_work+0x4c/0x7c
> > | process_one_work+0x148/0x2a0
> > | worker_thread+0x2fc/0x40c
> > | kthread+0x110/0x114
> > | ret_from_fork+0x10/0x20
> > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> > | ---[ end trace 0000000000000000 ]---
> >
> > Fix this by adding a NULL check before dereferencing this pointer.
> >
> > Signed-off-by: Sascha Hauer <s.hauer@...gutronix.de>
> >
> > ---
> >
> > This is the most obvious fix for this problem, but I am not sure if we
> > might want to catch priv->wdev.wiphy being NULL earlier in the call
> > chain.
>
> I haven't looked at the call but the symptoms sound like that either we
> are enabling the interrupts too early or there's some kind of locking
> problem so that an other cpu doesn't see the change.
I agree with Kalle that there's a different underlying bug involved, and
(my conclusion:) we shouldn't whack-a-mole the NULL pointer without
addressing the underlying problem.
Looking a bit closer (and without much other context to go on): I believe
that one potential underlying problem is the complete lack of locking
between cfg80211 entry points (such as mwifiex_add_virtual_intf() or
mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop
(mwifiex_main_process()). The former call sites only hold the wiphy
lock, and the latter tends to ... mostly not hold any locks, but rely on
sequentialization with itself, and using its |main_proc_lock| for setup
and teardown. It's all really bad and ready to fall down like a house of
cards at any moment. Unfortunately, no one has spent time on
rearchitecting this driver.
So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id()
/ mwifiex_get_priv()) is getting a hold of a not-fully-initialized
'priv' structure.
BTW, in case I can reproduce and poke at your scenario, what exactly
is your test case? Are you just starting / killing / restarting hostapd
in a loop? Are you running a full network manager stack that's doing
something more complex (e.g., initiating scans)? Can you reproduce with
some more targeted set of `iw` commands? (`iw phy ... interface add ...;
iw dev ... del`) Is there anything else interesting in the dmesg logs?
(Some of the worst behaviors in this driver come when we see command
timeouts and mwifiex_reinit_sw(), for example.)
Or barring that, can you get some kind of trace of the nl80211 command
sequence, so it's clearer which command(s) are involved leading up to
the problem?
Brian
Powered by blists - more mailing lists