lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230414102401.GAZDkpwUHfFM64dpIK@fat_crate.local>
Date:   Fri, 14 Apr 2023 12:24:01 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Paul Menzel <pmenzel@...gen.mpg.de>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        LKML <linux-kernel@...r.kernel.org>,
        Yazen Ghannam <yazen.ghannam@....com>
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17:
 d42040000000011b

On Fri, Apr 14, 2023 at 11:26:27AM +0200, Paul Menzel wrote:
> It says “no action required”,

Yes, it means you had a single bit flip in some DIMM and it got
corrected by the ECC so you don't need to do anything.

> but out of the identical 14 servers with the same workload this is the
> only one having shown this errors three times.

Or you could enable CONFIG_RAS_CEC and don't see those errors anymore.

It all depends: a DIMM could be producing correctable errors for a long
time before going bad. If ever. If you don't want to risk whatever
you're running on that machine by a DIMM *potentially* going bad, sure,
you can replace it. That's a budget call. :)

> Maybe the DIMM at bank 17 should just be replaced.

Bank 17 is the CPU MCA bank which reports the error - not a DIMM bank.
In order to pinpoint the location, you should have amd64_edac loaded so
that it decodes to which DIMM. You could try loading that module and
injecting all errors you have to see what it says, it should work this
way too as the error signature has everything needed for decoding,
AFAICT.

But Yazen can chime in here if I'm off.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ