netdev - Re: netconsole: HARDIRQ-safe -> HARDIRQ-unsafe lock order warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <847by65wfj.fsf@jogness.linutronix.de>
Date: Wed, 10 Sep 2025 14:28:40 +0206
From: John Ogness <john.ogness@...utronix.de>
To: Breno Leitao <leitao@...ian.org>
Cc: Mike Galbraith <efault@....de>, Simon Horman <horms@...nel.org>,
 kuba@...nel.org, calvin@...nvd.org, Pavel Begunkov
 <asml.silence@...il.com>, Johannes Berg <johannes@...solutions.net>,
 paulmck@...nel.org, LKML <linux-kernel@...r.kernel.org>,
 netdev@...r.kernel.org, boqun.feng@...il.com, Petr Mladek
 <pmladek@...e.com>, Sergey Senozhatsky <senozhatsky@...omium.org>, Steven
 Rostedt <rostedt@...dmis.org>
Subject: Re: netconsole: HARDIRQ-safe -> HARDIRQ-unsafe lock order warning

(Added CC printk folks since we are now talking about the nbcon API.)

Hi Breno,

On 2025-09-09, Breno Leitao <leitao@...ian.org> wrote:
> To summarize the problem:
>
> 1) netpoll calls .ndo_start_xmit() with IRQ disabled, which causes the
> lockdep problem reported in this thread. (current code)
>
> 2) moving netconsole to use NBCON will help in the thread context, given
> that .write_thread() doesn't need to have IRQ disabled. (This requires
> rework of netconsole target_list_lock)

Aside from reworking the target_list_lock, be aware that ->device_lock()
must at least disable migration. The kerneldoc for the ->device_lock()
callback talks about this.

> 3) In the atomic context, there is no easy solution so far. The options
> are not good, but, I will list them here for the sake of getting things
> clear:
>
>   a) Defer the msg as proposed initially.
>     Pro: If the machine is not crashing, it should simply work (?!)
>     Cons: It cannot be expected that the workqueue is functional during panic, thus
>           the messages might be lost

To be clear, "might be lost" means the message was not printed on the
netconsole. The message still will have made it into the ringbuffer and
attempted output to any other consoles as well as being available to
crash tools.

My problem with implementing deferring is that is what ->write_thread()
is already doing.

I wonder if we should extend the nbcon interface so that it is possible
to specify that ->write_atomic() is not safe. Then it would only be used
as a last resort in panic context.

@pmladek: We could introduce a new console flag (NBCON_ATOMIC_UNSAFE) so
that the callback is only used by nbcon_atomic_flush_unsafe().

>   b) Send the message anyway (and hope for the best)
>     Cons: Netpoll will continue to call IRQ unsafe locks from IRQ safe
>           context (lockdep will continue to be unhappy)
>     Pro: This is how it works today already, so, it is not making the problem worse.
>          In fact, it is narrowing the problem to only .write_atomic().

Two concerns here:

1. ->write_atomic() is also used during normal operation

2. It is expected that ->write_atomic() callbacks are implemented
   safely. The other nbcon citizens are doing this. Having an nbcon
   driver with an unsafe ->write_atomic() puts all nbcon drivers at risk
   of not functioning during panic.

This could be combined with (a) so that ->write_atomic() implements its
own deferred queue of messages to print and only when
@legacy_allow_panic_sync is true, will it try to send immediately and
hope for the best. @legacy_allow_panic_sync is set after all nbcon
drivers have had a chance to flush their buffers safely and then the
kernel starts to allow less safe drivers to flush.

Although I would prefer the NBCON_ATOMIC_UNSAFE approach instead.

>   c) Not implementing .write_atomic
>     Cons: we lose the most important messages of the boot.
>
>   Any other option I am not seeing?

d) Not implementing ->write_atomic() and instead implement a kmsg_dumper
   for netconsole. This registers a callback that is called during
   panic.

   Con: The kmsg_dumper interface has nothing to do with consoles, so it
        would require some effort coordinating with the console drivers.

   Pro: There is absolute freedom for the dumper to implement its own
        panic-only solution to get messages out.

e) Involve support from the underlying network drivers to implement true
   atomic sending. Thomas Gleixner talked [0] very briefly about how
   this could be implemented for netconsole during the 2022
   proof-of-concept presentation of the nbcon API.

   Cons: It most likely requires new API callbacks for the network
         drivers to implement hardware-specific solutions. Many (most?)
         drivers would not be able to support it.

   Pro: True reliable atomic printing via network.

John Ogness

[0] https://www.youtube.com/watch?v=TVhNcKQvzxI&t=3020s (begin 50:20)