[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090615162804.4cb75b30@lxorguk.ukuu.org.uk>
Date: Mon, 15 Jun 2009 16:28:04 +0100
From: Alan Cox <alan@...rguk.ukuu.org.uk>
To: Andi Kleen <andi@...stfloor.org>
Cc: Andi Kleen <andi@...stfloor.org>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
Wu Fengguang <fengguang.wu@...el.com>,
Balbir Singh <balbir@...ux.vnet.ibm.com>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
Mel Gorman <mel@....ul.ie>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Nick Piggin <npiggin@...e.de>,
"riel@...hat.com" <riel@...hat.com>,
"chris.mason@...cle.com" <chris.mason@...cle.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)
> oops=panic already implies panic on all machine check exceptions, so they will
> be fine then (assuming this is the best strategy for availability
> for them, which I personally find quite doubtful, but we can discuss this some
> other time)
You can have the argument with all the people who deploy large systems.
Providing their boxes can be persuaded to panic they don't care about the
masses.
> That's because unpoisioning is quite hard -- you need some kind
> of synchronization point for all the error handling and that's
> the poisoned page and if it unposions itself then you need
> some very heavy weight synchronization to avoid handling errors
> multiple time. I looked at it, but it's quite messy.
>
> Also it's of somewhat dubious value.
On a system running under a hypervisor or with hot swappable memory its
of rather higher value. In the hypervisor case the guest system can
acquire a new virtual page to replace the faulty one. In fact the
hypervisor case is even more complex as the guest may get migrated at
which point knowledge of "poisoned" memory is not at all connected to
information on hardware failings.
> >
> > (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> > - yes ?)
>
> Not architectually. Also the other problem is not just unpoisoning them,
> but finding out if the page is permenantly bad or just temporarily.
Small detail you are overlooking: Hot swap mirrorable memory.
Second detail you are overlooking
curse a lot
suspend to disk
remove dirt from fans, clean/replace RAM
resume from disk
The very act of making the ECC error not take out the box creates the
environment whereby the underlying hardware error (if there was one) can
be cured.
In all these cases the fact you've got to shoot stuff because a page has
been lost becomes totally disconnected from the idea that the page is
somehow not recoverable and "contaminated" forever.
Which to me says your model is wrong.
Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists