linux-kernel - Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170322180339.GC15888@nazgul.tnic>
Date:   Wed, 22 Mar 2017 19:03:39 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     X86 ML <x86@...nel.org>, linux-edac <linux-edac@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

On Mon, Mar 20, 2017 at 03:48:24PM -0700, Luck, Tony wrote:
> You added "count_threshold" for me ... so the condition isn't quite "overflows"
> like it was in the early versions.

It is a max count which, when reached, causes the soft offline attempt.
What did you mean with "overflows" exactly then?

> We may need to give some thought on what to do if the attempt to offline
> the page fails (e.g. because the page belongs to the kernel). Right now
> you delete it from the list, but we will see more errors as the page is
> still in use. Eventually the counter will hit count_threshold and we will
> try to offline again. Rinse, repeat.

Well, what *is* there we can do? If the offlining code can't offline
it, there's not a whole lot we *can* do. The error would keep repeating
as a corrected error, rinse, repeat and we will keep trying to offline
containing page.

That is, until it degrades to an uncorrectable error and then we're
dead.

Either way, the collector can't really do anything about it. This would
be beyond its functionality anyway.

IMO.

> Someone also recently sent me a log from a machine with corrected errors
> in over 9000 unique addresses. Need a parameter to allocate more than one
> page for the collector, or a way to grow the space.

Well, so even with the amount of unique addresses higher than the CEC
slots, we should be able to deal with them ok: the moment we enter more
than CLEAN_ELEMS pfns, we will trigger a spring cleaning which will
degrade the already logged errors. Once the array is filled up, we will
replace the LRU pfn with the new one.

And so on.

And this way it would fulfill its purpose of *not* generating error
records into the decoding chain after it. If one of those 9000 errors
overflows, we will try to offline the page.

Either way we work as advertized.

Lemme try to write a small script exercising exactly that scenario to
see whether I'm actually not talking crap here :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--