linux-kernel - Re: NOHZ tick-stop error: local softirq work is pending, handler #08!!! on Dell XPS 13 9360

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z6n-dWDSxNCjROYV@localhost.localdomain>
Date: Mon, 10 Feb 2025 14:26:13 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Paul Menzel <pmenzel@...gen.mpg.de>
Cc: Michał Pecio <michal.pecio@...il.com>,
	anna-maria@...utronix.de, linux-kernel@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, linux-usb@...r.kernel.org,
	mingo@...nel.org, tglx@...utronix.de
Subject: Re: NOHZ tick-stop error: local softirq work is pending, handler
 #08!!! on Dell XPS 13 9360

Le Mon, Feb 10, 2025 at 12:59:42PM +0100, Paul Menzel a écrit :
> Dear Michał,
> 
> 
> Thank you for your reply.
> 
> Am 10.02.25 um 12:45 schrieb Michał Pecio:
> 
> > > > > > > > > On Dell XPS 13 9360/0596KF, BIOS 2.21.0 06/02/2022, with Linux
> > > > > > > > > 6.9-rc2+
> > 
> > > Just for the record, I am still seeing this with 6.14.0-rc1
> > 
> > Is this a regression? If so, which versions were not affected?
> 
> Unfortunately, I do not know. Right now, my logs go back until September
> 2024.
> 
>     Sep 22 13:08:04 abreu kernel: Linux version 6.11.0-07273-g1e7530883cd2
> (build@...emianrhapsody.molgen.mpg.de) (gcc (Debian 14.2.0-5) 14.2.0, GNU ld
> (GNU Binutils for Debian) 2.43.1) #12 SMP PREEMPT_DYNAMIC Sun Sep 22
> 09:57:36 CEST 2024
> 
> > How hard to reproduce? Wasn't it during resume from hibernation?
> 
> It’s not easy to reproduce, and I believe it’s not related with resuming
> from hibernation (which I do not use) or ACPI S3 suspend. I think, I can
> force it more, when having the USB-C adapter with only the network cable
> plugged into it, and then running `sudo powertop --auto-tune`. But sometimes
> it seems unrelated.
> 
> > IRQ isuses may be a red herring, this code here is a busy wait under
> > spinlock. There are a few of those, they cause various problems.
> > 
> >                  if (xhci_handshake(&xhci->op_regs->status,
> >                                STS_RESTORE, 0, 100 * 1000)) {
> >                          xhci_warn(xhci, "WARN: xHC restore state timeout\n");
> > 			spin_unlock_irq(&xhci->lock);
> >                          return -ETIMEDOUT;
> >                  }
> > 
> > This thing timing out may be close to the root cause of everything.
> 
> Interesting. Hopefully the USB folks have an idea.

Handler #08 is NET_RX. So something raised the NET_RX on some non-appropriate
place, perhaps...

Can I ask you one more trace dump?

I need:

echo 1 > /sys/kernel/tracing/events/irq/softirq_raise/enable
echo 1 > /sys/kernel/tracing/options/stacktrace

Unfortunately this will also involve a small patch:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fa058510af9c..accd2eb8c927 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1159,6 +1159,9 @@ static bool report_idle_softirq(void)
 	if (local_bh_blocked())
 		return false;
 
+	trace_printk("STOP\n");
+	trace_dump_stack(0);
+	tracing_off();
 	pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n",
 		pending);
 	ratelimit++;



Thanks.