lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57AB79F8.8080306@redhat.com>
Date:	Wed, 10 Aug 2016 15:01:12 -0400
From:	Prarit Bhargava <prarit@...hat.com>
To:	"Long, Wai Man" <waiman.long@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"x86@...nel.org" <x86@...nel.org>, Borislav Petkov <bp@...e.de>,
	Andy Lutomirski <luto@...nel.org>,
	"Norton, Scott J" <scott.norton@....com>,
	"Hatch, Douglas B (HPE Servers - Linux)" <doug.hatch@....com>,
	"Wright, Randy (HPE Servers Linux)" <rwright@....com>
Subject: Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention



On 08/10/2016 02:37 PM, Long, Wai Man wrote:
> Hi,
> 
> I would like to restart the discussion about the merit of this patch.
> 
> This patch was created in response to a problem we have on the 16-socket Broadwell-EX systems (up to 768 logical CPUs) that were under development. About 10% of the kernel boots experienced soft lockups:
> 
> [   71.618132] NetLabel: Initializing
> [   71.621967] NetLabel:  domain hash size = 128
> [   71.626848] NetLabel:  protocols = UNLABELED CIPSOv4
> [   71.632418] NetLabel:  unlabeled traffic allowed by default
> [   71.638679] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
> [   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
> [   71.655313] Switching to clocksource hpet
> [   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
> [   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
> [   95.694203] Modules linked in:
> [   95.694697] CPU: 145 PID: 0 Comm: swapper/145 Not tainted
> 3.10.0-327.el7.x86_64 #1
> [   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
> [   95.696145] Hardware name: HP Superdome2 16s x86, BIOS Bundle: 
> 008.001.006
> SFW: 041.063.152 01/16/2016
> [   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
> [   95.699774] task: ffff8cf7fecf4500 ti: ffff89787c748000 task.ti: 
> ffff89787c748000
> 
> During the bootup process, there is a short time where the clocksource is switched to hpet to calibrate the tsc's. Then it will be switched back to tsc once the calibration is done. It is during the short period that soft lockups may happen.
> 
> Prarit also hit this problem with a smaller Intel box that has 96 cores (192 threads). Maybe he can supply more information of what he had seen.
> 

I've hit this on a system with 192 threads.  The TSC is functional and has
passed the TSC sync checks during boot.  When the HPET is used to resynchronize
the TSC, I occasionally see

PCI: Using ACPI for IRQ routing
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
hpet0: 8 comparators, 64-bit 24.000000 MHz counter
Switched to clocksource hpet

followed by the same NMI flood that Waiman described.  After some debugging I
came to the same conclusion that Waiman had, the HPET is causing contention on
the system with many threads accessing it rapidly.

After applying his patch the problem no longer occurs.

P.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ