[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090612164815.GA30773@elte.hu>
Date: Fri, 12 Jun 2009 18:48:15 +0200
From: Ingo Molnar <mingo@...e.hu>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Wu Fengguang <fengguang.wu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
Nick Piggin <npiggin@...e.de>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
Andi Kleen <andi@...stfloor.org>,
"riel@...hat.com" <riel@...hat.com>,
"chris.mason@...cle.com" <chris.mason@...cle.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when
feature is disabled
* H. Peter Anvin <hpa@...or.com> wrote:
> Ingo Molnar wrote:
> >
> > So i think hwpoison simply does not affect our ability to get
> > log messages out - but it sure allows crappier hardware to be
> > used. Am i wrong about that for some reason?
>
> Crappy hardware isn't the kind of hardware that is likely to have
> the hwpoison features, just like crappy hardware generally doesn't
> even have ECC -- or even basic parity checking (I personally think
> non-ECC memory should be considered a crime against humanity in
> this day and age.)
>
> You're making the fundamental assumption that failover and
> hardware replacement is a relatively cheap and fast operation. In
> high reliability applications, of course, failover is always an
> option -- it *HAS* to be an option -- but that doesn't mean that
> hardware replacement is cheap, fast or even possible -- and now
> you've blown your failover option.
>
> These kinds of features are used when extremely high reliability
> is required, think for example a telco core router. A page error
> may have happened due to stray radiation or through power supply
> glitches (which happen even in the best of systems), but if they
> are a pattern, a box needs to be replaced. *How quickly* a box
> can be taken out of service and replaced can vary greatly, and its
> urgency depend on patterns; furthermore, in the meantime the
> device has to work the best it can.
>
> Consider, for example, a control computer on the Hubble Space
> Telescope -- the only way to replace it is by space shuttle, and
> you can safely guarantee that *that* won't happen in a heartbeat.
> On the new Herschel Space Observatory, not even the space shuttle
> can help: if the computers die, *or* if bad data gets fed to its
> control system, the spacecraft is lost. As such, it's of
> paramount importance for the computers to (a) continue to provide
> service at the level the hardware is capable of doing, (b) as
> accurately as possible continually assess and report that level of
> service, and (c) not allow a failure to pass undetected. A lot of
> failures are simple one-time events (especially in space, a
> high-rad environment), others reflect decaying hardware but can be
> isolated (e.g. a RAM cell which has developed a short circuit, or
> a CPU core which has a damaged ALU), while others yet reflect a
> general ill health of the system that cannot be recovered.
>
> What these kinds of features do is it gives the overall-system
> designers and the administrators more options.
Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists