lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58f3242a-e52a-46a9-9a99-3887eeaa1285@linux.alibaba.com>
Date: Thu, 17 Jul 2025 11:03:51 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Breno Leitao <leitao@...ian.org>
Cc: Borislav Petkov <bp@...en8.de>, Alexander Graf <graf@...zon.com>,
 Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
 Peter Gonda <pgonda@...gle.com>, "Luck, Tony" <tony.luck@...el.com>,
 "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
 James Morse <james.morse@....com>, "Moore, Robert" <robert.moore@...el.com>,
 "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "acpica-devel@...ts.linux.dev" <acpica-devel@...ts.linux.dev>,
 "kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors



在 2025/7/16 20:42, Breno Leitao 写道:
> hello Shuai,
> 
> On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
>>> My plan with this patch is to have a counter for hardware errors that
>>> would be exposed to the crashdump. So, post-morten analyzes tooling can
>>> easily query if there are hardware errors and query RAS information in
>>> the right databases, in case it seems a smoking gun.
>>
>> I see your point. But does using a single ghes_recovered_errors counter
>> to track all corrected and non-fatal errors for CPU, memory, and PCIe
>> really help?
> 
> It provides a quick indication that hardware issues have occurred, which
> can prompt the operator to investigate further via RAS events.
> 
> That said, Tony proposed a more robust approach—categorizing and
> tracking errors by their source. This would involve maintaining separate
> counters for each source using an counter per enum type:
> 
> 	enum recovered_error_sources {
> 		ERR_GHES,
> 		ERR_MCE,
> 		ERR_AER,
> 		...
> 		ERR_NUM_SOURCES
> 	};
> 
> See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/
> 
> Do you think this would help you by any chance?
> 
> Thanks
> --breno


Personally, I think this approach would be more helpful. Additionally, I
suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
Errors) together. This is especially important for memory errors, as CEs
occur much more frequently than UEs, but their impact is much smaller.

Thanks.
Shuai

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ