linux-kernel - Re: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 7 Dec 2011 23:21:51 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	"Yu, Fenghua" <fenghua.yu@...el.com>,
	Borislav Petkov <bp@...64.org>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Thomas Gleixner <tglx@...utronix.de>,
	H Peter Anvin <hpa@...or.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	"Brown, Len" <len.brown@...el.com>,
	Randy Dunlap <rdunlap@...otime.net>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-pm <linux-pm@...r.kernel.org>, x86 <x86@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	"Herrmann3, Andreas" <Andreas.Herrmann3@....com>
Subject: Re: [PATCH v4 0/7] x86: BSP or CPU0 online/offline


* Luck, Tony <tony.luck@...el.com> wrote:

> > More importantly, you generally *cannot* realistically 
> > continue with a bad CPU anyway - the system will crash or 
> > will show signs of corruptions and you *want* a full 
> > powerdown and a clean reboot.
> 
> See the "Enhanced cache error reporting" section in the Intel 
> Software Developers manual (section 15.4 in volume 3B of the 
> latest edition).  Intel provides what is probably a very early 
> notification in many cases that a processors cache is 
> experiencing problems. At the time of the notification the 
> system is still functioning correctly.  The SDM suggests that 
> when the "yellow" status is signaled you should schedule 
> service "within a few weeks".

The question is, how realistically does this report true CPU 
troubles, statistically? The on-die cache might have the highest 
transistor count, but it's not under nearly the same thermal 
stress as functional units.

If 90% of all hard CPU failures can be predicted that way then 
it's probably useful. If it's only 20%, then not so much.

Also, it's still all theoretical until there's systems out there 
where the CPU socket is physically hotpluggable. If there's such 
plans in the works then sure, theory becomes reality and then 
it's all useful - and then we can do these patches (and more).

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/