lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0207C53569FE594381A4F2EB66570B2A018EED1381@orsmsx508.amr.corp.intel.com>
Date:	Wed, 7 Dec 2011 09:08:00 -0800
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Ingo Molnar <mingo@...e.hu>, "Yu, Fenghua" <fenghua.yu@...el.com>
CC:	Borislav Petkov <bp@...64.org>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Thomas Gleixner <tglx@...utronix.de>,
	H Peter Anvin <hpa@...or.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	"Brown, Len" <len.brown@...el.com>,
	Randy Dunlap <rdunlap@...otime.net>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-pm <linux-pm@...r.kernel.org>, x86 <x86@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	"Herrmann3, Andreas" <Andreas.Herrmann3@....com>
Subject: RE: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

> More importantly, you generally *cannot* realistically continue 
> with a bad CPU anyway - the system will crash or will show signs 
> of corruptions and you *want* a full powerdown and a clean 
> reboot.

See the "Enhanced cache error reporting" section in the Intel
Software Developers manual (section 15.4 in volume 3B of the
latest edition).  Intel provides what is probably a very early
notification in many cases that a processors cache is experiencing
problems. At the time of the notification the system is still
functioning correctly.  The SDM suggests that when the "yellow"
status is signaled you should schedule service "within a few weeks".

24x7 systems with a lot of sockets & cores, and highly paranoid
administrators, might want to take action to stop using the cores
that share the cache with problems sooner than they can schedule
downtime.

>- Special hardware environments that are deeply redundant and 
>   can warn about 'soft' failures well before hard failures
>   which gives a realistic window of time for a maintenance
>   hot-swap. [Such hardware actually exists, i even worked with
>   an x86 one eons ago.]

So not so special any more - every Xeon since Core Duo has the
cache error reporting capability.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ