[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJ-GMfXarWgEoYTH@mozart.vkv.me>
Date: Fri, 15 Aug 2025 12:10:41 -0700
From: Calvin Owens <calvin@...nvd.org>
To: Breno Leitao <leitao@...ian.org>
Cc: Jakub Kicinski <kuba@...nel.org>,
Pavel Begunkov <asml.silence@...il.com>,
Johannes Berg <johannes@...solutions.net>,
Mike Galbraith <efault@....de>, paulmck@...nel.org,
LKML <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org,
boqun.feng@...il.com
Subject: Re: netconsole: HARDIRQ-safe -> HARDIRQ-unsafe lock order warning
On Friday 08/15 at 10:29 -0700, Breno Leitao wrote:
> On Fri, Aug 15, 2025 at 09:42:17AM -0700, Jakub Kicinski wrote:
> > On Fri, 15 Aug 2025 11:44:45 +0100 Pavel Begunkov wrote:
> > > On 8/15/25 01:23, Jakub Kicinski wrote:
> >
> > I suspect disabling netconsole over WiFi may be the most sensible way out.
>
> I believe we might be facing a similar issue with virtio-net.
> Specifically, any network adapter where TX is not safe to use in IRQ
> context encounters this problem.
>
> If we want to keep netconsole enabled on all TX paths, a possible
> solution is to defer the transmission work when netconsole is called
> inside an IRQ.
>
> The idea is that netconsole first checks if it is running in an IRQ
> context using in_irq(). If so, it queues the skb without transmitting it
> immediately and schedules deferred work to handle the transmission
> later.
>
> A rough implementation could be:
>
> static void send_udp(struct netconsole_target *nt, const char *msg, int len) {
>
> /* get the SKB that is already populated, with all the headers
> * and ready to be sent
> */
> struct sk_buff = netpoll_get_skb(&nt->np, msg, len);
>
> if (in_irq()) {
> skb_queue_tail(&np->delayed_queue, skb);
> schedule_delayed_work(flush_delayed_queue, 0);
> return;
> }
>
> return __netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
> }
>
> This approach does not require additional memory or extra data copying,
> since copying from the printk buffer to the skb must be performed
> regardless.
>
> The main drawback is a slight delay for messages sent from within an IRQ
> context, though I believe such cases are infrequent.
>
> We could potentially also perform the flush from softirq context, which
> would help reduce this latency further.
If we take an OOPS in any IRQ, I suspect that delayed_work will never
get a chance to run, and we'll now lose all such OOPSes over netconsole?
I don't think softirq would get a chance either in that case?
Clearly, if it was a net driver's IRQ, that's not likely to happen
anyway. But in my experience, OOPSes in IRQs other than the driver
underlying netconsole's netdev *do* get emitted pretty reliably.
If your condition instead becomes:
if (in_irq() && !oops_in_progress)
...I think we can have our cake and eat it too? In an OOPS we're
busting locks and such, all bets are off anyway. Although, I suppose
that might still drop messages emitted immediately before it...
Powered by blists - more mailing lists