[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090408062107.GE17934@one.firstfloor.org>
Date: Wed, 8 Apr 2009 08:21:07 +0200
From: Andi Kleen <andi@...stfloor.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Andi Kleen <andi@...stfloor.org>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH] [0/16] POISON: Intro
On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <andi@...stfloor.org> wrote:
>
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
>
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?).
Multiple ones in fact.
One of them is
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
(test suite covering various cases)
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
(injector using the x86 specific error injection hooks I posted
earlier)
Then i have some tests using the madvise MADV_POISON hook
(which tests the various cases from a process stand points
and recovers). This is still a little hackish, but if there's
interest I can put it out. It has at least one test case
that is known to hang (non linear mappings), still looking
at that.
Long term plan was to put both mce-test above and the
MADV_POISON test into LTP.
And a few random hacks. But coverage is still not 100%
> A simplistic approach would be
Random kill anywhere is hard to test because your system will
die regularly and randomly. mce-test.git does some automated
testing of fatal errors by catching them using kexec, but we haven't
tried that for full recovery.
>
> echo some-pfn > /proc/bad-pfn-goes-here
>
> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.
mce-test/inject does it from other CPUs with smp_function_call_single,
so it's really relatively random. I've considered to use NMIs too,
but at least the high level recovery code synchronizes first
to work queue context anyways, so it doesn't buy us too much for that.
-Andi
--
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists