lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181227065321.GA3749@lerouge>
Date:   Thu, 27 Dec 2018 07:53:23 +0100
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Heiner Kallweit <hkallweit1@...il.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Anna-Maria Gleixner <anna-maria@...utronix.de>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Grygorii Strashko <grygorii.strashko@...com>
Subject: Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may
 have revealed another problem

On Mon, Oct 15, 2018 at 10:58:54PM +0200, Heiner Kallweit wrote:
> On 28.09.2018 15:18, Frederic Weisbecker wrote:
> > On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
> >> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> >>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> >>>> I tested it and Frederic is right, it doesn't help. Can it be somehow related to
> >>>> the cpu being brought down during suspend? Because I get the warning only during
> >>>> suspend when the cpu is inactive already (but still online).
> >>>
> >>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> >>>
> >>> Does this script eventually trigger it after some time?
> >>
> >> Any update to this?
> > 
> > Heiner? Can you please test the script I sent to you?
> > 
> > Thanks.
> > 
> Sorry, took some time .. And yes, running your script triggers the message too.
> 
> [   25.646015] x86: Booting SMP configuration:
> [   25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.664491] smpboot: CPU 1 is now offline
> [   25.679299] x86: Booting SMP configuration:
> [   25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.698449] smpboot: CPU 1 is now offline
> [   25.711698] x86: Booting SMP configuration:
> [   25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.729185] NOHZ: local_softirq_pending 202
> [   25.729229] NOHZ: local_softirq_pending 202
> [   25.730759] smpboot: CPU 1 is now offline
> [   25.744053] x86: Booting SMP configuration:
> [   25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.762520] smpboot: CPU 1 is now offline
> [   25.776834] x86: Booting SMP configuration:
> [   25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.794189] NOHZ: local_softirq_pending 202
> [   25.796672] smpboot: CPU 1 is now offline
> [   25.805970] x86: Booting SMP configuration:
> [   25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.827360] smpboot: CPU 1 is now offline
> [   25.839043] x86: Booting SMP configuration:
> [   25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.858184] NOHZ: local_softirq_pending 202
> [   25.862182] smpboot: CPU 1 is now offline
> [   25.873759] x86: Booting SMP configuration:
> [   25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.893385] smpboot: CPU 1 is now offline
> 

Sorry, I got sidetracked and almost forgot about it.

So this is triggered by CPU hotplug. At some point the CPU has an
opportunity to go idle and for some reason the timer softirq is still
pending. We need to know which timer this is about and why the timer
softirq keeps pending.

I'm going to need your help again. Can you please run the following (possibly
repeat until it triggers the bug) ?

   echo 1 > /sys/devices/system/cpu/cpu1/online

   # Pause and reset tracing
   echo 0 > /sys/kernel/debug/tracing/tracing_on
   echo > /sys/kernel/debug/tracing/trace

   # Turn on relevant events
   echo 1 > /sys/kernel/debug/tracing/events/timer/timer_*/enable
   echo 1 > /sys/kernel/debug/tracing/events/irq/softirq_*/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   # Trigger
   echo 0 > /sys/devices/system/cpu/cpu1/online

   echo 0 > /sys/kernel/debug/tracing/tracing_on

And please apply the following patch before. With all that I'll have the
relevant informations stored in /sys/kernel/debug/tracing/per_cpu/cpu1/trace
Please send its content to me. Thanks!

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b..0e57a3b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -892,6 +892,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 		    (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
 			pr_warn("NOHZ: local_softirq_pending %02x\n",
 				(unsigned int) local_softirq_pending());
+			trace_dump_stack(0);
 			ratelimit++;
 		}
 		return false;

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ