lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 27 Jun 2020 15:14:14 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     paulmck@...nel.org
Cc:     Andy Lutomirski <luto@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-team <Kernel-team@...com>
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning


> On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <paulmck@...nel.org> wrote:
> 
> On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>>> 
>>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
>>> (where "HH" is the hexadecimal softirq vector number) when one or more
>>> non-RCU softirq handlers are still enablded when checking to stop the
>>> scheduler-tick interrupt.  This message is not as enlightening as one
>>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
>>> local softirq work is pending, handler #HH.
>> 
>> Thank you!  It would be even better if it would explain *why* the
>> problem happened, but I suppose this code doesn't actually know.
> 
> Glad to help!
> 
> To your point, is it possible to bisect the appearance of this message,
> or is it as usual non-reproducible?  (Hey, had to ask!)
> 
>                            

In this particular case, I tracked it down by good old fashioned sleuthing for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The bug is that we were entering the kernel from usermode, doing nmi_enter(), turning on interrupts, maybe getting a page fault, raising a signal, turning off interrupts, nmi_exit(), and back to usermode, with the signal still queued and undelivered.  This is all kinds of bad, but I still don’t understand what softirqs or idle have to do with it.

But I have the bug fixed now!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ