lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753784A2B@SHSMSX103.ccr.corp.intel.com>
Date:   Tue, 15 Aug 2017 01:16:51 +0000
From:   "Liang, Kan" <kan.liang@...el.com>
To:     'Don Zickus' <dzickus@...hat.com>,
        'Thomas Gleixner' <tglx@...utronix.de>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mingo@...nel.org" <mingo@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "babu.moger@...cle.com" <babu.moger@...cle.com>,
        "atomlin@...hat.com" <atomlin@...hat.com>,
        "prarit@...hat.com" <prarit@...hat.com>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "eranian@...gle.com" <eranian@...gle.com>,
        "acme@...hat.com" <acme@...hat.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: RE: [PATCH V2] kernel/watchdog: fix spurious hard lockups



> On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote:
> > Hi Don & Thomas,
> >
> > Sorry for the late response. We just finished the tests for all proposed
> patches.
> >
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my original proposal which increase the NMI watchdog timeout
> > by 3X https://patchwork.kernel.org/patch/9802053/
> >
> > According to our test, only patch 3 works well.
> > The other two patches will hang the system eventually.
> > For patch 1, the system hang after running our test case for ~1 hour.
> > For patch 2, the system hang in running the overnight test.
> > There is no error message shown when the system hang. So I don't know
> > the root cause yet.
> 
> Hi Kan,
> 
> Thanks for the feedback.  Odd that the different patches had different results.
> What is more odd to me is the hang.  I thought these were all false lockups
> that prematurely panic'd and rebooted the box.
> 
> Is the machine configured to panic on hardlockup and reboot?  Perhaps
> kdump is enabled to store the console log for review upon reboot?
> 
> It almost implies that a hardlockup did happen but isnt' being detected until
> later??
> >
> > BTW: We set 1 to watchdog_thresh when we did the test.
> > It's believed that can speed up the failure.
> 
> Sure, you/they look for 1 second hangs instead of 10 second ones.  But with
> patch3 it is more like 3 seconds'ish vs 30 second'ish.
> 
> As Thomas asked, I would also be interested in the way the test works.  The
> hang doesn't make sense.
> 

Hi Don and Thomas, 

Sorry for the late response.

We have confirmed that the hardlock with "speed up the hrtimer" patch is
actually another issue. Tim has already proposed a patch to fix it. 
Here is his patch. https://lkml.org/lkml/2017/8/14/1000

This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
is decent to fix the spurious hard lockups.
Tested-by: Kan Liang <kan.liang@...el.com>

Please consider to merge it into both mainline and stable tree.

Thanks,
Kan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ