lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yr61Jy6aGhxeulxN@zn.tnic>
Date:   Fri, 1 Jul 2022 10:49:43 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "patches@...ts.linux.dev" <patches@...ts.linux.dev>,
        Yazen Ghannam <yazen.ghannam@....com>
Subject: Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to
 "2"

On Thu, Jun 30, 2022 at 10:02:36AM -0700, Luck, Tony wrote:
> Yes. The cost to offline a page is low (4KB reduction in system capacity
> on a system with 10's or 100's of GB memory).

*If* that page is going to go bad at all.

> The risk to the system if the page does develop an uncorected error is
> high (process is killed, or system crashes).

That's not what the papers say.

> The question is whether the default threshold should be "do I feel
> lucky?" and those corrected errors are nothing to worry about. Or
> "do I want to take the safe path?" and premptively offline pages
> at the first sign of trouble.

Well, we can't decide that for every possible situation so if Intel's
recommendation is to do that on Intel systems, then users can set that.

/sys/kernel/debug/ras/cec/action_threshold is perhaps not the perfect
interface for that but we can make something more user-friendly.

> Is there a study about "wobbly" DIMMs?

Most of the papers I looked at say that the majority of errors are CE
and that there's a likelihood that those errors can turn UE but none is
quantifying that likelihood. One paper says that a huge number of the
errors are transient. If you offline such a page just because two alpha
particles flew through it, you're offlining a perfectly good page.

DRAM vendor is also important as different DRAM vendors show different
error stats. And so on and so on.

So you can't simply go and decide for all and say, the answer is 2.

> We now have some real data. Instead of a "finger in the air guess"
> that was made (on a different generation of DIMM technology ... the
> AMD paper you reference below says DDR4 is 5.5x worse than DDR3).

In the next sentence it says that the hardware handles those errors just
fine!

> Second most common on DDR4 DIMMs is "row failure". Which current ECC
> systems don't handle well.

This is not what we're talking about here - we're talking about
offlining pages after 2 CEs.

As to the row offlining - yes, no question there, we need to address
that.

> While that's low from one perspective (0.6% servers affected) it's high
> enough to be interesting to the CSP - because they lose revenue and
> reputation when they have to tell their customers: "sorry the VM you
> rented from us just crashed". Note that one physical system crashing
> may take down dozens of VMs.

So that whitepaper doesn't specify what they call "fault". Because
in one of the papers in the Reference section, they explain the
terminology:

"A fault is the underlying cause of an error, such as a stuck-at bit or
high-energy particle strike. Faults can be active (causing errors), or
dormant (not causing errors).

An error is an incorrect portion of state resulting from an active
fault, such as an incorrect value in memory. Errors may be detected and
possibly corrected by higher level mechanisms such as parity or error
correcting codes (ECC). They may also go uncorrected, or in the worst
case, completely undetected (i.e., silent)."

So even if we put on the most pessimistic glasses and say that 0.6%
of the faults result in system crashes, then CSP can go and set the
threshold to something lower for their use case after following
recommendations by DRAM and CPU vendor and so on.

> While anyone can tune the RAS_CEC threshold. The default value should
> be something reasonable. I'm sticking with "2" being much more
> reasonable default than 1023.

You can make that configurable or Intel-only or whatever - but not
unconditional for everyone.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ