lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250730182137.18605ea1@foz.lan>
Date: Wed, 30 Jul 2025 18:21:37 +0200
From: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To: Breno Leitao <leitao@...ian.org>
Cc: Shuai Xue <xueshuai@...ux.alibaba.com>, Tony Luck <tony.luck@...el.com>,
 Borislav Petkov <bp@...en8.de>, "Rafael J. Wysocki" <rafael@...nel.org>,
 Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>, Robert
 Moore <robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>, Ingo
 Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>, Hanjun Guo
 <guohanjun@...wei.com>, Mauro Carvalho Chehab <mchehab@...nel.org>, Mahesh
 J Salgaonkar <mahesh@...ux.ibm.com>, Oliver O'Halloran <oohall@...il.com>,
 Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
 linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev,
 osandov@...ndov.com, konrad.wilk@...cle.com, linux-edac@...r.kernel.org,
 linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
 kernel-team@...a.com
Subject: Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware
 errors

Em Wed, 30 Jul 2025 06:11:52 -0700
Breno Leitao <leitao@...ian.org> escreveu:

> Hello Shuai,
> 
> On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
> > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
> > CPER_SEV_RECOVERABLE errors:  
> 
> Thanks. I was reading this code a bit more, and I want to make sure my
> understanding is correct, giving I was confused about CORRECTED and
> RECOVERABLE errors.
> 
> CPER_SEV_CORRECTED means it is corrected in the background, and the OS
> was not even notified about it. That includes 1-bit ECC error.
> THose are not the errors we are interested in, since they are irrelavant
> to the OS.

Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture
such errors, as they may be a symptom of a hardware defect. In a matter
of fact, at rasdamon, thresholds can be set to trigger an action, like
for instance, disable memory blocks that contain defective memories.

This is specially relevant on HPC and supercomputer workloads, where
it is a lot cheaper to disable a block of bad memory than to lose
an entire job because that could take several weeks of run time on
a supercomputer, just because a defective memory ended causing a
failure at the application.

Regards,
Mauro

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ