linux-kernel - Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1401296015.4361.12.camel@oc3432500282.ibm.com>
Date:	Wed, 28 May 2014 09:53:35 -0700
From:	Max Asbock <masbock@...ux.vnet.ibm.com>
To:	Chen Yucong <slaoub@...il.com>
Cc:	Borislav Petkov <bp@...en8.de>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-edac <linux-edac@...r.kernel.org>, X86 ML <x86@...nel.org>,
	Tony Luck <tony.luck@...el.com>
Subject: Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing

On Wed, 2014-05-28 at 10:49 +0800, Chen Yucong wrote:
> > From: Borislav Petkov <bp@...e.de>
> > 
> > Hi all,
> > 
> > this is something Tony and I have been working on behind the curtains
> > recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> > me send it out before I start hammering on it on a real machine.
> > 
> > More indepth info about what it is and what it does is in patch 1/3.
> > 
> > As always, comments and suggestions are most welcome.
> > 
> > Thanks.
> 
> What's the point of this patch set?
> My understanding is that if there are some(COUNT_MASK) corrected DRAM
> ECC errors for a specific page frame, we can believe that the page frame
> is so ill that it should be isolated as soon as possible.
> 
> The question is: memory_failure can not be used for isolating the page
> frame which is being used by kernel, because it just poison the page and
> IGNORED. memory_failure is mostly used for handling AR/AO type errors
> related to the page frame which the userspace tasks are using now.
> 
> Although the relative page frame is very ill, it is not dead and can
> still work. However, memory_failure may kill the userspace tasks,
> especially for those page frames that are holding dynamic data rather
> than file-backed(file/swap) data.
> 
> So I do not think that it is a good idea to directly use memory_failure
> in this patch set. 
> 

I second that. You can't poison a page and potentially kill an
application just because an arbitrarily chosen number of corrected
errors has been exceeded. That would be an anti-RAS feature: less
reliability and availability.
A possible alternative would be to soft-offline the page. This is
currently done in APEI code when corrected memory error thresholds are
exceeded and reported by UEFI via a generic hardware error source
(GHES). 
The example is in ghes_handle_memory_failure() where we call
memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE

- Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/