lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111207222151.GB18356@elte.hu>
Date:	Wed, 7 Dec 2011 23:21:51 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	"Yu, Fenghua" <fenghua.yu@...el.com>,
	Borislav Petkov <bp@...64.org>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Thomas Gleixner <tglx@...utronix.de>,
	H Peter Anvin <hpa@...or.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	"Brown, Len" <len.brown@...el.com>,
	Randy Dunlap <rdunlap@...otime.net>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-pm <linux-pm@...r.kernel.org>, x86 <x86@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	"Herrmann3, Andreas" <Andreas.Herrmann3@....com>
Subject: Re: [PATCH v4 0/7] x86: BSP or CPU0 online/offline


* Luck, Tony <tony.luck@...el.com> wrote:

> > More importantly, you generally *cannot* realistically 
> > continue with a bad CPU anyway - the system will crash or 
> > will show signs of corruptions and you *want* a full 
> > powerdown and a clean reboot.
> 
> See the "Enhanced cache error reporting" section in the Intel 
> Software Developers manual (section 15.4 in volume 3B of the 
> latest edition).  Intel provides what is probably a very early 
> notification in many cases that a processors cache is 
> experiencing problems. At the time of the notification the 
> system is still functioning correctly.  The SDM suggests that 
> when the "yellow" status is signaled you should schedule 
> service "within a few weeks".

The question is, how realistically does this report true CPU 
troubles, statistically? The on-die cache might have the highest 
transistor count, but it's not under nearly the same thermal 
stress as functional units.

If 90% of all hard CPU failures can be predicted that way then 
it's probably useful. If it's only 20%, then not so much.

Also, it's still all theoretical until there's systems out there 
where the CPU socket is physically hotpluggable. If there's such 
plans in the works then sure, theory becomes reality and then 
it's all useful - and then we can do these patches (and more).

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ