lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 28 Jun 2017 13:14:04 -0700
From:   Andi Kleen <ak@...ux.intel.com>
To:     Don Zickus <dzickus@...hat.com>
Cc:     "Liang, Kan" <kan.liang@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mingo@...nel.org" <mingo@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "babu.moger@...cle.com" <babu.moger@...cle.com>,
        "atomlin@...hat.com" <atomlin@...hat.com>,
        "prarit@...hat.com" <prarit@...hat.com>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "eranian@...gle.com" <eranian@...gle.com>,
        "acme@...hat.com" <acme@...hat.com>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups

On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > > 
> > > The above patch looks good to me.
> > 
> > This needs performance testing.  It may slow down performance or latency sensitive workloads.
> 
> More motivation to work through the issues with the proposed real fix? :-)
> 
> > 
> > > Which workaround do you prefer, the above one or the one checking timestamp?
> > 
> > I prefer the earlier patch, it has far less risk of performance issues.
> 
> But now you are slowing down the nmi_watchdog so much that the
> watchdog_thresh hold becomes meaningless, no? (granted the turbo-mode blows
> it out of the water too)  So now folks who depend on the 10/5/1/whatever second
> reliability lose that.  I think that might be unfair too.

What do you mean with reliability? If you need guarantees of resetting
you always need another separate hardware watchdog (like the TCO watchdog),
as the CPU could be hung up enough that even the NMI watchdog is not 
functional anymore.

So relying solely on the NMI watchdog doesn't make any sense.

It can be a useful debugging tool for a specific class of bugs: 
when kernel software is looping forever.

But if that happens does it really matter how many iterations the
loop does before it is stopped?

Even the current timeout is essentially eternity in CPU time, and 3x
eternity is still eternity.

> The hrtimer increase maintains that and just adds a few more
> interrupts/second.

Interruptions are a big deal for many people.

-Andi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ