lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090616202726.GB31443@sgi.com>
Date:	Tue, 16 Jun 2009 15:27:26 -0500
From:	Russ Anderson <rja@....com>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Ingo Molnar <mingo@...e.hu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andi Kleen <andi@...stfloor.org>,
	"riel@...hat.com" <riel@...hat.com>,
	"chris.mason@...cle.com" <chris.mason@...cle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, rja@....com
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled

On Mon, Jun 15, 2009 at 08:52:32AM +0200, Nick Piggin wrote:
> On Fri, Jun 12, 2009 at 05:35:01PM +0200, Ingo Molnar wrote:
> > * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > > On Fri, 12 Jun 2009, Ingo Molnar wrote:
> > > > 
> > > > This seems like trying to handle a failure mode that cannot be 
> > > > and shouldnt be 'handled' really. If there's an 'already 
> > > > corrupted' page then the box should go down hard and fast, and 
> > > > we should not risk _even more user data corruption_ by trying to 
> > > > 'continue' in the hope of having hit some 'harmless' user 
> > > > process that can be killed ...
> > > 
> > > No, the box should _not_ go down hard-and-fast. That's the last 
> > > thing we should *ever* do.
> > > 
> > > We need to log it. Often at a user level (ie we want to make sure 
> > > it actually hits syslog, possibly goes out the network, maybe pops 
> > > up a window, whatever).
> > > 
> > > Shutting down the machine is the last thing we ever want to do.
> > > 
> > > The whole "let's panic" mentality is a disease.
> > 
> > No doubt about that - and i'm removing BUG_ON()s and panic()s 
> > wherever i can and havent added a single new one myself in the past 
> > 5 years or so, its a disease.
> 
> In HA failover systems you often do want to panic ASAP (after logging
> to serial cosole I guess) if anything like this happens so the system
> can be rebooted with minimal chance of data corruption spreading.

The whole point of hardware data poisoning is to avoid having to 
panic the system due to the potential of undetected data corruption,
because the corrupt data is always marked bad.  This has worked
well on ia64 where applications that encounter bad data are killed
and the memory poisoned and not reallocated, avoiding a system panic.

This has been used at customer sites for a few years.  The type
customers that really check their data.  It is nice to see
the hardware poison feature moving to the x86 "mainstream".



-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@....com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ