[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090616202726.GB31443@sgi.com>
Date: Tue, 16 Jun 2009 15:27:26 -0500
From: Russ Anderson <rja@....com>
To: Nick Piggin <npiggin@...e.de>
Cc: Ingo Molnar <mingo@...e.hu>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Wu Fengguang <fengguang.wu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
Andi Kleen <andi@...stfloor.org>,
"riel@...hat.com" <riel@...hat.com>,
"chris.mason@...cle.com" <chris.mason@...cle.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, rja@....com
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > >
> > > > This seems like trying to handle a failure mode that cannot be
> > > > and shouldnt be 'handled' really. If there's an 'already
> > > > corrupted' page then the box should go down hard and fast, and
> > > > we should not risk _even more user data corruption_ by trying to
> > > > 'continue' in the hope of having hit some 'harmless' user
> > > > process that can be killed ...
> > >
> > > No, the box should _not_ go down hard-and-fast. That's the last
> > > thing we should *ever* do.
> > >
> > > We need to log it. Often at a user level (ie we want to make sure
> > > it actually hits syslog, possibly goes out the network, maybe pops
> > > up a window, whatever).
> > >
> > > Shutting down the machine is the last thing we ever want to do.
> > >
> > > The whole "let's panic" mentality is a disease.
> >
> > No doubt about that - and i'm removing BUG_ON()s and panic()s
> > wherever i can and havent added a single new one myself in the past
> > 5 years or so, its a disease.
>
> In HA failover systems you often do want to panic ASAP (after logging
> to serial cosole I guess) if anything like this happens so the system
> can be rebooted with minimal chance of data corruption spreading.
The whole point of hardware data poisoning is to avoid having to
panic the system due to the potential of undetected data corruption,
because the corrupt data is always marked bad. This has worked
well on ia64 where applications that encounter bad data are killed
and the memory poisoned and not reallocated, avoiding a system panic.
This has been used at customer sites for a few years. The type
customers that really check their data. It is nice to see
the hardware poison feature moving to the x86 "mainstream".
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@....com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists