linux-kernel - Re: lockup and kernel panic in linux-next-202505{09,12} when compiled with clang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ba97a2559cda1b14e0c9754523ff1152bdad90ef.camel@web.de>
Date: Thu, 15 May 2025 11:10:31 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Johannes Berg <johannes@...solutions.net>,
 "linux-kernel@...r.kernel.org"	 <linux-kernel@...r.kernel.org>
Cc: "linux-next@...r.kernel.org" <linux-next@...r.kernel.org>, 
 "llvm@...ts.linux.dev"
	 <llvm@...ts.linux.dev>, Thomas Gleixner <tglx@...utronix.de>, 
	linux-wireless@...r.kernel.org, spasswolf@....de
Subject: Re: lockup and kernel panic in linux-next-202505{09,12} when
 compiled with clang

Am Donnerstag, dem 15.05.2025 um 08:30 +0200 schrieb Johannes Berg:
> On Thu, 2025-05-15 at 00:27 +0200, Bert Karwatzki wrote:
> > Am Mittwoch, dem 14.05.2025 um 20:56 +0200 schrieb Johannes Berg:
> > > > 
> > > > I've split off the problematic piece of code into an noinline function to simplify the disassembly:
> > > > 
> > > 
> > > Oh and also, does it even still crash with that? :)
> > 
> > Yes, it still crashes when compiled with clang.
> 
> OK, just checking. :)

To be more precise I need clang AND PREEMPT_RT=y to get a crash.

> 
> FWIW, I'm not convinced at all that the code you were looking at is
> really the problem. The crash (see below) is happening on the status
> side. Of course it cannot crash on the status side if on the TX side we
> never enter anything into the IDR data structure, and never tag the SKB
> to look up in the IDR and therefore never try to create the status
> report on the status side.

After looking at the backtrace I'm also no longer conviced that piece of code is
the problem.

> 
> Basically what happens is this:
> 
> - on TX, if we have a socket requesting status, create a copy of the
>   SKB, put it into the IDR, and put the IDR index into the original
>   skb->cb
> - then transmit the original skb, of course
> - on TX status report from the driver, see if the skb->cb is tagged with
>   the IDR value, if so, report the copy of the SKB back to the socket
>   with the status information
> 
> (The reason we need to make a copy is that the SKB could be encrypted or
> otherwise modified in flight, and we don't want to undo that, rather
> keeping a copy for the report.)
> 
> >  [  267.339591][  T575] BUG: unable to handle page fault for address: ffffffff51e080b0
> >  [  267.339598][  T575] #PF: supervisor write access in kernel mode
> >  [  267.339602][  T575] #PF: error_code(0x0002) - not-present page
> >  [  267.339606][  T575] PGD f1cc3c067 P4D f1cc3c067 PUD 0 
> >  [  267.339613][  T575] Oops: Oops: 0002 [#1] SMP NOPTI
> >  [  267.339622][  T575] CPU: 0 UID: 0 PID: 575 Comm: napi/phy0-0 Not tainted
> > 6.15.0-rc6-next-20250513-llvm-00009-gec34cd07a425 #968 PREEMPT_{RT,(full)} 
> >  [  267.339629][  T575] Hardware name: Micro-Star International Co., Ltd. Alpha
> > 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024
> >  [  267.339632][  T575] RIP: 0010:queued_spin_lock_slowpath+0x120/0x1c0
> ...
> > [  267.339692][  T575] Call Trace:
> >  [  267.339701][  T575]  <TASK>
> >  [  267.339705][  T575]  _raw_spin_lock_irqsave+0x57/0x60
> >  [  267.339714][  T575]  rt_spin_lock+0x73/0xa0
> >  [  267.339720][  T575]  sock_queue_err_skb+0xdc/0x140
> >  [  267.339727][  T575]  skb_complete_wifi_ack+0xa9/0x120
> >  [  267.339737][  T575]  ieee80211_report_used_skb+0x541/0x6e0 [mac80211]
> >  [  267.339799][  T575]  ? srso_alias_return_thunk+0x5/0xfbef5
> >  [  267.339804][  T575]  ? start_dl_timer+0xcf/0x110
> >  [  267.339814][  T575]  ieee80211_tx_status_ext+0x3b3/0x870 [mac80211]
> >  [  267.339851][  T575]  ? raw_spin_rq_lock_nested+0x15/0x80
> >  [  267.339862][  T575]  ? srso_alias_return_thunk+0x5/0xfbef5
> >  [  267.339866][  T575]  ? rt_spin_lock+0x3d/0xa0
> >  [  267.339873][  T575]  ? mt76_tx_status_unlock+0x38/0x230 [mt76]
> >  [  267.339886][  T575]  mt76_tx_status_unlock+0x1e0/0x230 [mt76]
> 
> Yeah so that's the crash on the status report as explained above, it
> kind of looks almost like the skb->sk was freed and somehow invalid now?
> But I don't see a general issue here (will keep digging), and how come
> it only shows up with clang?
> 
> Since it reproduces pretty reliably, maybe you could do with KASAN?
> 

I'm currently doing a testrun with KASAN enabled, test is running ~1h so far
(without KASAN the max time to a crash was about 10min), so KASAN is probably
killing the bug (there are no messages from KASAN in dmesg).

> Also could be interesting - what userspace are you running with wifi?
> What tool is even setting up the wifi status? If you don't really know
> maybe just put WARN_ON(1) into net/core/sock.s where SO_WIFI_STATUS is
> written (sk_setsockopt).
>
> johannes

For the recording these backtraces I disabled wifi just after booting (it
usually takes ~5s to connect here) with network manager (nmcli)(from debian sid
(last updated on 20250511, before I encountered this bug))
$ nmcli radio wifi off
then I set up the netconsole and reenabled wifi and waited for the crash
$ nmcli radio wifi on

Bert Karwatzki