[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ba97a2559cda1b14e0c9754523ff1152bdad90ef.camel@web.de>
Date: Thu, 15 May 2025 11:10:31 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Johannes Berg <johannes@...solutions.net>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc: "linux-next@...r.kernel.org" <linux-next@...r.kernel.org>,
"llvm@...ts.linux.dev"
<llvm@...ts.linux.dev>, Thomas Gleixner <tglx@...utronix.de>,
linux-wireless@...r.kernel.org, spasswolf@....de
Subject: Re: lockup and kernel panic in linux-next-202505{09,12} when
compiled with clang
Am Donnerstag, dem 15.05.2025 um 08:30 +0200 schrieb Johannes Berg:
> On Thu, 2025-05-15 at 00:27 +0200, Bert Karwatzki wrote:
> > Am Mittwoch, dem 14.05.2025 um 20:56 +0200 schrieb Johannes Berg:
> > > >
> > > > I've split off the problematic piece of code into an noinline function to simplify the disassembly:
> > > >
> > >
> > > Oh and also, does it even still crash with that? :)
> >
> > Yes, it still crashes when compiled with clang.
>
> OK, just checking. :)
To be more precise I need clang AND PREEMPT_RT=y to get a crash.
>
> FWIW, I'm not convinced at all that the code you were looking at is
> really the problem. The crash (see below) is happening on the status
> side. Of course it cannot crash on the status side if on the TX side we
> never enter anything into the IDR data structure, and never tag the SKB
> to look up in the IDR and therefore never try to create the status
> report on the status side.
After looking at the backtrace I'm also no longer conviced that piece of code is
the problem.
>
> Basically what happens is this:
>
> - on TX, if we have a socket requesting status, create a copy of the
> SKB, put it into the IDR, and put the IDR index into the original
> skb->cb
> - then transmit the original skb, of course
> - on TX status report from the driver, see if the skb->cb is tagged with
> the IDR value, if so, report the copy of the SKB back to the socket
> with the status information
>
> (The reason we need to make a copy is that the SKB could be encrypted or
> otherwise modified in flight, and we don't want to undo that, rather
> keeping a copy for the report.)
>
> > [ 267.339591][ T575] BUG: unable to handle page fault for address: ffffffff51e080b0
> > [ 267.339598][ T575] #PF: supervisor write access in kernel mode
> > [ 267.339602][ T575] #PF: error_code(0x0002) - not-present page
> > [ 267.339606][ T575] PGD f1cc3c067 P4D f1cc3c067 PUD 0
> > [ 267.339613][ T575] Oops: Oops: 0002 [#1] SMP NOPTI
> > [ 267.339622][ T575] CPU: 0 UID: 0 PID: 575 Comm: napi/phy0-0 Not tainted
> > 6.15.0-rc6-next-20250513-llvm-00009-gec34cd07a425 #968 PREEMPT_{RT,(full)}
> > [ 267.339629][ T575] Hardware name: Micro-Star International Co., Ltd. Alpha
> > 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024
> > [ 267.339632][ T575] RIP: 0010:queued_spin_lock_slowpath+0x120/0x1c0
> ...
> > [ 267.339692][ T575] Call Trace:
> > [ 267.339701][ T575] <TASK>
> > [ 267.339705][ T575] _raw_spin_lock_irqsave+0x57/0x60
> > [ 267.339714][ T575] rt_spin_lock+0x73/0xa0
> > [ 267.339720][ T575] sock_queue_err_skb+0xdc/0x140
> > [ 267.339727][ T575] skb_complete_wifi_ack+0xa9/0x120
> > [ 267.339737][ T575] ieee80211_report_used_skb+0x541/0x6e0 [mac80211]
> > [ 267.339799][ T575] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 267.339804][ T575] ? start_dl_timer+0xcf/0x110
> > [ 267.339814][ T575] ieee80211_tx_status_ext+0x3b3/0x870 [mac80211]
> > [ 267.339851][ T575] ? raw_spin_rq_lock_nested+0x15/0x80
> > [ 267.339862][ T575] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 267.339866][ T575] ? rt_spin_lock+0x3d/0xa0
> > [ 267.339873][ T575] ? mt76_tx_status_unlock+0x38/0x230 [mt76]
> > [ 267.339886][ T575] mt76_tx_status_unlock+0x1e0/0x230 [mt76]
>
> Yeah so that's the crash on the status report as explained above, it
> kind of looks almost like the skb->sk was freed and somehow invalid now?
> But I don't see a general issue here (will keep digging), and how come
> it only shows up with clang?
>
> Since it reproduces pretty reliably, maybe you could do with KASAN?
>
I'm currently doing a testrun with KASAN enabled, test is running ~1h so far
(without KASAN the max time to a crash was about 10min), so KASAN is probably
killing the bug (there are no messages from KASAN in dmesg).
> Also could be interesting - what userspace are you running with wifi?
> What tool is even setting up the wifi status? If you don't really know
> maybe just put WARN_ON(1) into net/core/sock.s where SO_WIFI_STATUS is
> written (sk_setsockopt).
>
> johannes
For the recording these backtraces I disabled wifi just after booting (it
usually takes ~5s to connect here) with network manager (nmcli)(from debian sid
(last updated on 20250511, before I encountered this bug))
$ nmcli radio wifi off
then I set up the netconsole and reenabled wifi and waited for the crash
$ nmcli radio wifi on
Bert Karwatzki
Powered by blists - more mailing lists