[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1401296015.4361.12.camel@oc3432500282.ibm.com>
Date: Wed, 28 May 2014 09:53:35 -0700
From: Max Asbock <masbock@...ux.vnet.ibm.com>
To: Chen Yucong <slaoub@...il.com>
Cc: Borislav Petkov <bp@...en8.de>,
LKML <linux-kernel@...r.kernel.org>,
linux-edac <linux-edac@...r.kernel.org>, X86 ML <x86@...nel.org>,
Tony Luck <tony.luck@...el.com>
Subject: Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
On Wed, 2014-05-28 at 10:49 +0800, Chen Yucong wrote:
> > From: Borislav Petkov <bp@...e.de>
> >
> > Hi all,
> >
> > this is something Tony and I have been working on behind the curtains
> > recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> > me send it out before I start hammering on it on a real machine.
> >
> > More indepth info about what it is and what it does is in patch 1/3.
> >
> > As always, comments and suggestions are most welcome.
> >
> > Thanks.
>
> What's the point of this patch set?
> My understanding is that if there are some(COUNT_MASK) corrected DRAM
> ECC errors for a specific page frame, we can believe that the page frame
> is so ill that it should be isolated as soon as possible.
>
> The question is: memory_failure can not be used for isolating the page
> frame which is being used by kernel, because it just poison the page and
> IGNORED. memory_failure is mostly used for handling AR/AO type errors
> related to the page frame which the userspace tasks are using now.
>
> Although the relative page frame is very ill, it is not dead and can
> still work. However, memory_failure may kill the userspace tasks,
> especially for those page frames that are holding dynamic data rather
> than file-backed(file/swap) data.
>
> So I do not think that it is a good idea to directly use memory_failure
> in this patch set.
>
I second that. You can't poison a page and potentially kill an
application just because an arbitrarily chosen number of corrected
errors has been exceeded. That would be an anti-RAS feature: less
reliability and availability.
A possible alternative would be to soft-offline the page. This is
currently done in APEI code when corrected memory error thresholds are
exceeded and reported by UEFI via a generic hardware error source
(GHES).
The example is in ghes_handle_memory_failure() where we call
memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE
- Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists