linux-kernel - Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090616202726.GB31443@sgi.com>
Date:	Tue, 16 Jun 2009 15:27:26 -0500
From:	Russ Anderson <rja@....com>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Ingo Molnar <mingo@...e.hu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>,
	"riel@...hat.com" <riel@...hat.com>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, rja@....com
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled

On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > 
> > > > This seems like trying to handle a failure mode that cannot be 
> > > > and shouldnt be 'handled' really. If there's an 'already 
> > > > corrupted' page then the box should go down hard and fast, and 
> > > > we should not risk _even more user data corruption_ by trying to 
> > > > 'continue' in the hope of having hit some 'harmless' user 
> > > > process that can be killed ...
> > > 
> > > No, the box should _not_ go down hard-and-fast. That's the last 
> > > thing we should *ever* do.
> > > 
> > > We need to log it. Often at a user level (ie we want to make sure 
> > > it actually hits syslog, possibly goes out the network, maybe pops 
> > > up a window, whatever).
> > > 
> > > Shutting down the machine is the last thing we ever want to do.
> > > 
> > > The whole "let's panic" mentality is a disease.
> > 
> > No doubt about that - and i'm removing BUG_ON()s and panic()s 
> > wherever i can and havent added a single new one myself in the past 
> > 5 years or so, its a disease.
> 
> In HA failover systems you often do want to panic ASAP (after logging
> to serial cosole I guess) if anything like this happens so the system
> can be rebooted with minimal chance of data corruption spreading.

The whole point of hardware data poisoning is to avoid having to 
panic the system due to the potential of undetected data corruption,
because the corrupt data is always marked bad.  This has worked
well on ia64 where applications that encounter bad data are killed
and the memory poisoned and not reallocated, avoiding a system panic.

This has been used at customer sites for a few years.  The type
customers that really check their data.  It is nice to see
the hardware poison feature moving to the x86 "mainstream".



-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@....com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/