[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A327CB1.6060009@redhat.com>
Date: Fri, 12 Jun 2009 12:05:05 -0400
From: Rik van Riel <riel@...hat.com>
To: Ingo Molnar <mingo@...e.hu>
CC: Linus Torvalds <torvalds@...ux-foundation.org>,
Wu Fengguang <fengguang.wu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
Nick Piggin <npiggin@...e.de>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
Andi Kleen <andi@...stfloor.org>,
"chris.mason@...cle.com" <chris.mason@...cle.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature
is disabled
Ingo Molnar wrote:
> So i think hwpoison simply does not affect our ability to get log
> messages out - but it sure allows crappier hardware to be used.
> Am i wrong about that for some reason?
You are :)
A 2-bit memory error can be a temporary failure, eg.
due to a cosmic ray. If bit errors could be prevented
in hardware, there would be no reason to have ECC at all.
The only reason to stop using that page is because we
do not know for sure whether the error was temporary
or permanent (or dependent on a particular bit pattern).
Userspace needs to be notified that some data disappeared,
if it did - for clean pagecache and swap cache pages, the
kernel can simply take the page away and wait for a page
fault...
The sysadmin needs to know that something happened too,
because the hardware *might* have a problem.
However, a 2-bit error does not imply that the hardware
actually needs to be replaced.
--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists