linux-kernel - Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 12 Jun 2009 18:48:15 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Nick Piggin <npiggin@...e.de>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>,
	"riel@...hat.com" <riel@...hat.com>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when
	feature is disabled


* H. Peter Anvin <hpa@...or.com> wrote:

> Ingo Molnar wrote:
> > 
> > So i think hwpoison simply does not affect our ability to get 
> > log messages out - but it sure allows crappier hardware to be 
> > used. Am i wrong about that for some reason?
> 
> Crappy hardware isn't the kind of hardware that is likely to have 
> the hwpoison features, just like crappy hardware generally doesn't 
> even have ECC -- or even basic parity checking (I personally think 
> non-ECC memory should be considered a crime against humanity in 
> this day and age.)
> 
> You're making the fundamental assumption that failover and 
> hardware replacement is a relatively cheap and fast operation.  In 
> high reliability applications, of course, failover is always an 
> option -- it *HAS* to be an option -- but that doesn't mean that 
> hardware replacement is cheap, fast or even possible -- and now 
> you've blown your failover option.
> 
> These kinds of features are used when extremely high reliability 
> is required, think for example a telco core router.  A page error 
> may have happened due to stray radiation or through power supply 
> glitches (which happen even in the best of systems), but if they 
> are a pattern, a box needs to be replaced.  *How quickly* a box 
> can be taken out of service and replaced can vary greatly, and its 
> urgency depend on patterns; furthermore, in the meantime the 
> device has to work the best it can.
> 
> Consider, for example, a control computer on the Hubble Space 
> Telescope -- the only way to replace it is by space shuttle, and 
> you can safely guarantee that *that* won't happen in a heartbeat.  
> On the new Herschel Space Observatory, not even the space shuttle 
> can help: if the computers die, *or* if bad data gets fed to its 
> control system, the spacecraft is lost.  As such, it's of 
> paramount importance for the computers to (a) continue to provide 
> service at the level the hardware is capable of doing, (b) as 
> accurately as possible continually assess and report that level of 
> service, and (c) not allow a failure to pass undetected.  A lot of 
> failures are simple one-time events (especially in space, a 
> high-rad environment), others reflect decaying hardware but can be 
> isolated (e.g. a RAM cell which has developed a short circuit, or 
> a CPU core which has a damaged ALU), while others yet reflect a 
> general ill health of the system that cannot be recovered.
> 
> What these kinds of features do is it gives the overall-system 
> designers and the administrators more options.

Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/