lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150508075347.GB5403@gmail.com>
Date:	Fri, 8 May 2015 09:53:47 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Vince Weaver <vincent.weaver@...ne.edu>
Cc:	linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Jiri Olsa <jolsa@...hat.com>, Ingo Molnar <mingo@...hat.com>,
	Paul Mackerras <paulus@...ba.org>
Subject: Re: perf: WARNING perfevents: irq loop stuck!


* Vince Weaver <vincent.weaver@...ne.edu> wrote:

> On Fri, 1 May 2015, Ingo Molnar wrote:
> 
> > So 0000fffffffffffe corresponds to 2 events left until overflow, 
> > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we 
> > allow these super short periods.
> > 
> > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well, 
> > one similar to bdw_limit_period()? Something like the patch below?
> > 
> > Totally untested and such. I picked 128 because of Broadwell, but 
> > lower values might work as well. You could try to increase it to 3 and 
> > upwards and see which one stops triggering stuck NMI loops?
> 
> I spent a lot of time trying to come up with a test case that 
> triggered this more reliably but failed.
> 
> It definitely is an issue with PMC0 being -2 causing the PMC0 bit in 
> the status register getting stuck and no clearing.  Often there is 
> also a PEBS event active at the same time but that might be 
> coincidence.
> 
> With your patch applied I can't trigger the issue. I haven't tried 
> narrowing down the exact value yet.

So how about I change it from 128U to 2U and apply it upstream?

I.e. use the minimal threshold that we have observed to cause 
problems. That way should it ever trigger in different circumstances 
we'll eventually trigger it or hear about it.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ