[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090612164815.GA30773@elte.hu>
Date:	Fri, 12 Jun 2009 18:48:15 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Nick Piggin <npiggin@...e.de>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>,
	"riel@...hat.com" <riel@...hat.com>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when
	feature is disabled
* H. Peter Anvin <hpa@...or.com> wrote:
> Ingo Molnar wrote:
> > 
> > So i think hwpoison simply does not affect our ability to get 
> > log messages out - but it sure allows crappier hardware to be 
> > used. Am i wrong about that for some reason?
> 
> Crappy hardware isn't the kind of hardware that is likely to have 
> the hwpoison features, just like crappy hardware generally doesn't 
> even have ECC -- or even basic parity checking (I personally think 
> non-ECC memory should be considered a crime against humanity in 
> this day and age.)
> 
> You're making the fundamental assumption that failover and 
> hardware replacement is a relatively cheap and fast operation.  In 
> high reliability applications, of course, failover is always an 
> option -- it *HAS* to be an option -- but that doesn't mean that 
> hardware replacement is cheap, fast or even possible -- and now 
> you've blown your failover option.
> 
> These kinds of features are used when extremely high reliability 
> is required, think for example a telco core router.  A page error 
> may have happened due to stray radiation or through power supply 
> glitches (which happen even in the best of systems), but if they 
> are a pattern, a box needs to be replaced.  *How quickly* a box 
> can be taken out of service and replaced can vary greatly, and its 
> urgency depend on patterns; furthermore, in the meantime the 
> device has to work the best it can.
> 
> Consider, for example, a control computer on the Hubble Space 
> Telescope -- the only way to replace it is by space shuttle, and 
> you can safely guarantee that *that* won't happen in a heartbeat.  
> On the new Herschel Space Observatory, not even the space shuttle 
> can help: if the computers die, *or* if bad data gets fed to its 
> control system, the spacecraft is lost.  As such, it's of 
> paramount importance for the computers to (a) continue to provide 
> service at the level the hardware is capable of doing, (b) as 
> accurately as possible continually assess and report that level of 
> service, and (c) not allow a failure to pass undetected.  A lot of 
> failures are simple one-time events (especially in space, a 
> high-rad environment), others reflect decaying hardware but can be 
> isolated (e.g. a RAM cell which has developed a short circuit, or 
> a CPU core which has a damaged ALU), while others yet reflect a 
> general ill health of the system that cannot be recovered.
> 
> What these kinds of features do is it gives the overall-system 
> designers and the administrators more options.
Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.
	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
