lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87prpa88iw.fsf@basil.nowhere.org>
Date:	Sat, 19 Jul 2008 12:37:11 +0200
From:	Andi Kleen <andi@...stfloor.org>
To:	Russ Anderson <rja@....com>
Cc:	mingo@...e.hu, tglx@...utronix.de, Tony Luck <tony.luck@...el.com>,
	linux-kernel@...r.kernel.org, linux-ia64@...r.kernel.org
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

Russ Anderson <rja@....com> writes:

> [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

FWIW I discussed this with some hardware people and the general
opinion was that it was way too aggressive to disable a page on the
first corrected error like this patchkit currently does.  

The corrected bit error could be caused by a temporary condition
e.g. in the DIMM link, and does not necessarily mean that part of the
DIMM is really going bad. Permanently disabling would only be
justified if you saw repeated corrected errors over a long time from
the same DIMM.

There are also some potential scenarios where being so aggressive
could hurt, e.g. if you have a low rate of random corrected events
spread randomly all over your memory (e.g. with a flakey DIMM
connection) after a long enough uptime you could lose significant parts
of your memory even though the DIMM is actually still ok.

Also the other issue that if the DIMM is going bad then it's likely
larger areas than just the lines making up this page. So you
would still risk uncorrected errors anyways because disabling
the page would only cover a small subset of the affected area.

If you really wanted to do this you probably should hook it up
to mcelog's (or the IA64 equivalent) DIMM database and then
control it from user space with suitable large thresholds
and DIMM specific knowledge. But it's unlikely it can be really
done nicely in a way that is isolated from very specific 
knowledge about the underlying memory configuration.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ