lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170322180339.GC15888@nazgul.tnic>
Date:   Wed, 22 Mar 2017 19:03:39 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     X86 ML <x86@...nel.org>, linux-edac <linux-edac@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

On Mon, Mar 20, 2017 at 03:48:24PM -0700, Luck, Tony wrote:
> You added "count_threshold" for me ... so the condition isn't quite "overflows"
> like it was in the early versions.

It is a max count which, when reached, causes the soft offline attempt.
What did you mean with "overflows" exactly then?

> We may need to give some thought on what to do if the attempt to offline
> the page fails (e.g. because the page belongs to the kernel). Right now
> you delete it from the list, but we will see more errors as the page is
> still in use. Eventually the counter will hit count_threshold and we will
> try to offline again. Rinse, repeat.

Well, what *is* there we can do? If the offlining code can't offline
it, there's not a whole lot we *can* do. The error would keep repeating
as a corrected error, rinse, repeat and we will keep trying to offline
containing page.

That is, until it degrades to an uncorrectable error and then we're
dead.

Either way, the collector can't really do anything about it. This would
be beyond its functionality anyway.

IMO.

> Someone also recently sent me a log from a machine with corrected errors
> in over 9000 unique addresses. Need a parameter to allocate more than one
> page for the collector, or a way to grow the space.

Well, so even with the amount of unique addresses higher than the CEC
slots, we should be able to deal with them ok: the moment we enter more
than CLEAN_ELEMS pfns, we will trigger a spring cleaning which will
degrade the already logged errors. Once the array is filled up, we will
replace the LRU pfn with the new one.

And so on.

And this way it would fulfill its purpose of *not* generating error
records into the decoding chain after it. If one of those 9000 errors
overflows, we will try to offline the page.

Either way we work as advertized.

Lemme try to write a small script exercising exactly that scenario to
see whether I'm actually not talking crap here :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ