linux-kernel - Re: [BUG] Core2 cpu triggers hard lockup with perf test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160301091703.GN6356@twins.programming.kicks-ass.net>
Date:	Tue, 1 Mar 2016 10:17:03 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	"Liang, Kan" <kan.liang@...el.com>
Cc:	Jiri Olsa <jolsa@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Stephane Eranian <eranian@...gle.com>,
	Wang Nan <wangnan0@...wei.com>,
	"zheng.z.yan@...el.com" <zheng.z.yan@...el.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [BUG] Core2 cpu triggers hard lockup with perf test

On Mon, Feb 29, 2016 at 10:12:08PM +0000, Liang, Kan wrote:

> In SDM "18.4.4.4 Re-configuring PEBS Facilities" it mentioned that
> a quiescent period is needed between stopping the prior event counting and
> setting up a new PEBS event when software needs to reconfigure PEBS facilities.
> The quiescent period is to allow any latent residual PEBS records to complete
> its capture at their previously specified buffer address

> That requirement only can be found in Core Microarchitecture. 

But that should apply to all (PEBS) event scheduling, not just the
multi thing.

Also very convenient that quiescent period is so well defined. How long
should we wait, a day?

> I think it may implies that there is some observed delay in writing PEBS buffer.

Doesn't it explicitly state just that?

> So if perf record precise hw event with very small period, the slow PEBS writing
> may lockup the CPU. 

And I still don't see how this would explain a lockup in the MSR writes.

[ Jiri, can you disable that stupid panic on hard lockup and let it run
for a while, see if all the lockup msgs hit the same IP? Also, can you
look where exactly that IP lives in the code? ]

So I suspect it actually just did the PERF_GLOBAL_CTRL write, how else
would the hardware watchdog trigger on that same CPU.

After that, there's only BTS muck, which you're not using, so WTH is it
actually stuck on?

> If so, I think disabling the multiple pebs should be a good way.

As said, this should affect any and all PEBS event scheduling, not just
the multi stuff.