lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 19 Jul 2008 17:06:49 +0200
From:	Andi Kleen <andi@...stfloor.org>
To:	Matthew Wilcox <matthew@....cx>
Cc:	Russ Anderson <rja@....com>, mingo@...e.hu, tglx@...utronix.de,
	Tony Luck <tony.luck@...el.com>, linux-kernel@...r.kernel.org,
	linux-ia64@...r.kernel.org
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

Matthew Wilcox <matthew@....cx> writes:

> On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
>> Russ Anderson <rja@....com> writes:
>> 
>> > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
>> 
>> FWIW I discussed this with some hardware people and the general
>> opinion was that it was way too aggressive to disable a page on the
>> first corrected error like this patchkit currently does.  
>
> I think it's reasonable to take a page out of service on the first error.
> Then a user program needs to be notified of which bit is suspected.
> It can then subject that page to an intense set of tests (I'd start
> by stealing the ones from memtest86+) and if no more errors are found,
> it could return the page to service.

That would only really help if really only parts of that specific page
is corrupted.  But my understanding is that DIMM failures usually
cluster in larger units (channels, DIMMs, memory chips on them, banks
inside the chips etc., all far larger than a 4K page)

So to do your proposal you would need to do this on the units of whole
DIMMs or at least their pages, otherwise it is somewhat
pointless. Since the memory systems typically interleave this would
likely need to be done on multiple DIMMs, potentially covering a large
memory area.

In the end you'll end up with most of the mess of memory hot unplug
because the more memory is affected the more likely it is 
some unmoveable kernel data is affected.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ