linux-kernel - Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <83B12EF8-3792-4943-A548-5DB0C6FC71D1@amacapital.net>
Date:   Sat, 27 Jun 2020 15:14:14 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     paulmck@...nel.org
Cc:     Andy Lutomirski <luto@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-team <Kernel-team@...com>
Subject: Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning


> On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <paulmck@...nel.org> wrote:
> 
> On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>>> 
>>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
>>> (where "HH" is the hexadecimal softirq vector number) when one or more
>>> non-RCU softirq handlers are still enablded when checking to stop the
>>> scheduler-tick interrupt.  This message is not as enlightening as one
>>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
>>> local softirq work is pending, handler #HH.
>> 
>> Thank you!  It would be even better if it would explain *why* the
>> problem happened, but I suppose this code doesn't actually know.
> 
> Glad to help!
> 
> To your point, is it possible to bisect the appearance of this message,
> or is it as usual non-reproducible?  (Hey, had to ask!)
> 
>                            

In this particular case, I tracked it down by good old fashioned sleuthing for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The bug is that we were entering the kernel from usermode, doing nmi_enter(), turning on interrupts, maybe getting a page fault, raising a signal, turning off interrupts, nmi_exit(), and back to usermode, with the signal still queued and undelivered.  This is all kinds of bad, but I still don’t understand what softirqs or idle have to do with it.

But I have the bug fixed now!