[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090612153501.GA5737@elte.hu>
Date: Fri, 12 Jun 2009 17:35:01 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
Nick Piggin <npiggin@...e.de>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
Andi Kleen <andi@...stfloor.org>,
"riel@...hat.com" <riel@...hat.com>,
"chris.mason@...cle.com" <chris.mason@...cle.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when
feature is disabled
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Fri, 12 Jun 2009, Ingo Molnar wrote:
> >
> > This seems like trying to handle a failure mode that cannot be
> > and shouldnt be 'handled' really. If there's an 'already
> > corrupted' page then the box should go down hard and fast, and
> > we should not risk _even more user data corruption_ by trying to
> > 'continue' in the hope of having hit some 'harmless' user
> > process that can be killed ...
>
> No, the box should _not_ go down hard-and-fast. That's the last
> thing we should *ever* do.
>
> We need to log it. Often at a user level (ie we want to make sure
> it actually hits syslog, possibly goes out the network, maybe pops
> up a window, whatever).
>
> Shutting down the machine is the last thing we ever want to do.
>
> The whole "let's panic" mentality is a disease.
No doubt about that - and i'm removing BUG_ON()s and panic()s
wherever i can and havent added a single new one myself in the past
5 years or so, its a disease.
If a fault hits a harmless piece of the system, then the log message
will make it out and people know what happened. hwpoison does not
affect that at all. If the fault hits the critical path towards
gettig the log message out - then we wont get a log message,
hwpoison or not.
My point is that hwpoison allows the _ignoring_ of hardware problems
and thus pushes more buggy hardware up the pipeline.
Clusters will be running with this under the (false IMO) assumption
that the kernel will tell the admin when something bad happened and
the machine can limp along otherwise.
So i think hwpoison simply does not affect our ability to get log
messages out - but it sure allows crappier hardware to be used.
Am i wrong about that for some reason?
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists