netdev - Re: netconsole: HARDIRQ-safe -> HARDIRQ-unsafe lock order warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <tgp5ddd2xdcvmkrhsyf2r6iav5a6ksvxk66xdw6ghur5g5ggee@cuz2o53younx>
Date: Tue, 26 Aug 2025 05:43:09 -0700
From: Breno Leitao <leitao@...ian.org>
To: Mike Galbraith <efault@....de>, Simon Horman <horms@...nel.org>, 
	kuba@...nel.org, calvin@...nvd.org
Cc: Pavel Begunkov <asml.silence@...il.com>, 
	Jakub Kicinski <kuba@...nel.org>, Johannes Berg <johannes@...solutions.net>, paulmck@...nel.org, 
	LKML <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org, boqun.feng@...il.com
Subject: Re: netconsole: HARDIRQ-safe -> HARDIRQ-unsafe lock order warning

On Fri, Aug 22, 2025 at 05:54:28AM +0200, Mike Galbraith wrote:
> On Thu, 2025-08-21 at 10:35 -0700, Breno Leitao wrote:
> > > On Thu, Aug 21, 2025 at 05:51:59AM +0200, Mike Galbraith wrote:
>  
> > > > > --- a/drivers/net/netconsole.c
> > > > > +++ b/drivers/net/netconsole.c
> > > > > @@ -1952,12 +1952,12 @@ static void netcon_write_thread(struct c
> > > > >  static void netconsole_device_lock(struct console *con, unsigned long *flags)
> > > > >  {
> > > > >  	/* protects all the targets at the same time */
> > > > > -	spin_lock_irqsave(&target_list_lock, *flags);
> > > > > +	spin_lock(&target_list_lock);
> > > 
> > > I personally think this target_list_lock can be moved to an RCU lock.
> > > 
> > > If that is doable, then we probably make netconsole_device_lock()
> > > to a simple `rcu_read_lock()`, which would solve this problem as well.
> 
> The bigger issue for the nbcon patch would seem to be the seemingly
> required .write_atomic leading to landing here with disabled IRQs.

In this case, instead of transmitting through netpoll directly in the
.write_atomic context, we could queue the messages for later delivery.

With the current implementation, this is not straightforward unless we
introduce an additional message copy at the start of .write_atomic.

This is where the interface between netpoll and netconsole becomes
problematic. Ideally, we would avoid carrying extra data into netconsole
and instead copy the message into an SKB and queue the SKB for
transmission.

The core issue is that netpoll and netconsole are tightly coupled, and
several pieces of functionality that live in netpoll really belong in
netconsole. A good example is the SKB pool: that’s a netconsole concept,
not a netpoll one. None of the other netpoll users send raw char *
messages. They all work directly with skbs, so, in order to achieve it,
we need to move the concept of skb pool into netconsole, and give
netconsole the management of the skb pool.

> WRT my patch, seeing a hard RT crash on wired box cleanly logged with
> your nbcon patch applied (plus my twiddle mentioned earlier) tells me
> my patch has lost its original reason to exist.  It's relevant to this
> thread only in that those once thought to be RT specific IRQ disable
> spots turned out to actually be RT agnostic wireless sore spots.

Thanks. As a follow-up, I would suggest the following steps:

1) Decouple the SKB pool from netpoll and move it into netconsole

  * This makes netconsole behave like any other netpoll user,
    interacting with netpoll by sending SKBs.
	* The SKB population logic would then reside in netconsole, where it
	  logically belongs.

  * Enable NBCONS in netconsole, guarded by NETCONSOLE_NBCON
	* In normal .write_atomic() mode, messages should be queued in
	  a workqueue.
	* If oops_in_progress is set, we bypass the queue and
	  transmit the SKB immediately. (Maybe disabling lockdep?!).

Any concern with this plan?
--breno