linux-kernel - Re: marching through all physical memory in software

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4984489C.8020309@buttersideup.com>
Date:	Sat, 31 Jan 2009 12:48:28 +0000
From:	Tim Small <tim@...tersideup.com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
CC:	Doug Thompson <norsk5@...oo.com>, ncunningham-lkml@...a.org.au,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Chris Friesen <cfriesen@...tel.com>,
	Pavel Machek <pavel@...e.cz>,
	bluesmoke-devel@...ts.sourceforge.net,
	Arjan van de Ven <arjan@...radead.org>
Subject: Re: marching through all physical memory in software

Eric W. Biederman wrote:
> At the point we are talking about software scrubbing it makes sense to assume
> a least common denominator memory controller, one that does not do automatic
> write-back of the corrected value, as all of the recent memory controllers
> do scrubbing in hardware.
>   

I was just trying to clarify the distinction between the two processes 
which have similar names, but aren't (IMO) actually that similar:

"Software Scrubbing"

Triggering a read, and subsequent rewrite of a particular RAM location 
which has suffered a correctable ECC error(s) i.e. hardware detects an 
error, then the OS takes care of the rewrite to "scrub" the error in the 
case that the hardware doesn't handle this automatically.

This should be a very-occasional error-path process, and performance is 
probably not critical..

"Background Scrubbing"

. This is a poor name, IMO (scrub infers some kind of write to me), 
which applies to a process whereby you ensure that the ECC check-bits 
are verified periodically for the whole of physical RAM, so that single 
bit errors in a given ECC block don't accumulate and turn into 
uncorrectable errors.  It may also lead to improved data collection for 
some failure modes.  Again, many memory controllers implement this 
feature in hardware, so we shouldn't do it twice where this is supported.

There is (AFAIK) no need to do any writes here, and in fact doing so is 
only likely to hurt performance, I think....  The design which springs 
to mind is of a background thread which (possibly at idle priority) 
reads RAM at a user-configurable rate (e.g. consume a max of n% of 
memory bandwidth, or read  all of RAM at least once every x minutes).  
Possible design issues:

. There will be some trade off between reducing impact on the system as 
a whole, and making firm guarantees about how often memory is checked.  
Difficult to know what the default would be, but probably 
no-firm-guarantee of minimum time (idle processing only) is likely to 
cause least problems for most users.
. An eye will need to be kept on the impact that this reading has on the 
performance of the rest of the system (e.g. cache pollution, and NUMA, 
as you previously mentioned), but my gut feeling is that for the 
majority of systems it shouldn't be significant.  If practical 
mechanisms are available on some CPUs to read RAM without populating the 
CPU cache, we should use those (but I've no idea if they exist or not).

Perhaps a good default would be to benchmark memory read bandwidth when 
the feature is turned on, and then operate at (e.g.) 0.5% of that bandwidth.

Cheers,

Tim.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/