linux-kernel - RE: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0207C53569FE594381A4F2EB66570B2A018EED1381@orsmsx508.amr.corp.intel.com>
Date:	Wed, 7 Dec 2011 09:08:00 -0800
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Ingo Molnar <mingo@...e.hu>, "Yu, Fenghua" <fenghua.yu@...el.com>
CC:	Borislav Petkov <bp@...64.org>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Thomas Gleixner <tglx@...utronix.de>,
	H Peter Anvin <hpa@...or.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	"Brown, Len" <len.brown@...el.com>,
	Randy Dunlap <rdunlap@...otime.net>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-pm <linux-pm@...r.kernel.org>, x86 <x86@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	"Herrmann3, Andreas" <Andreas.Herrmann3@....com>
Subject: RE: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

> More importantly, you generally *cannot* realistically continue 
> with a bad CPU anyway - the system will crash or will show signs 
> of corruptions and you *want* a full powerdown and a clean 
> reboot.

See the "Enhanced cache error reporting" section in the Intel
Software Developers manual (section 15.4 in volume 3B of the
latest edition).  Intel provides what is probably a very early
notification in many cases that a processors cache is experiencing
problems. At the time of the notification the system is still
functioning correctly.  The SDM suggests that when the "yellow"
status is signaled you should schedule service "within a few weeks".

24x7 systems with a lot of sockets & cores, and highly paranoid
administrators, might want to take action to stop using the cores
that share the cache with problems sooner than they can schedule
downtime.

>- Special hardware environments that are deeply redundant and 
>   can warn about 'soft' failures well before hard failures
>   which gives a realistic window of time for a maintenance
>   hot-swap. [Such hardware actually exists, i even worked with
>   an x86 one eons ago.]

So not so special any more - every Xeon since Core Duo has the
cache error reporting capability.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/