linux-kernel - Re: [PATCH] ghes: Track number of recovered hardware errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <58f3242a-e52a-46a9-9a99-3887eeaa1285@linux.alibaba.com>
Date: Thu, 17 Jul 2025 11:03:51 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Breno Leitao <leitao@...ian.org>
Cc: Borislav Petkov <bp@...en8.de>, Alexander Graf <graf@...zon.com>,
 Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
 Peter Gonda <pgonda@...gle.com>, "Luck, Tony" <tony.luck@...el.com>,
 "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
 James Morse <james.morse@....com>, "Moore, Robert" <robert.moore@...el.com>,
 "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "acpica-devel@...ts.linux.dev" <acpica-devel@...ts.linux.dev>,
 "kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors

在 2025/7/16 20:42, Breno Leitao 写道:
> hello Shuai,
> 
> On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
>>> My plan with this patch is to have a counter for hardware errors that
>>> would be exposed to the crashdump. So, post-morten analyzes tooling can
>>> easily query if there are hardware errors and query RAS information in
>>> the right databases, in case it seems a smoking gun.
>>
>> I see your point. But does using a single ghes_recovered_errors counter
>> to track all corrected and non-fatal errors for CPU, memory, and PCIe
>> really help?
> 
> It provides a quick indication that hardware issues have occurred, which
> can prompt the operator to investigate further via RAS events.
> 
> That said, Tony proposed a more robust approach—categorizing and
> tracking errors by their source. This would involve maintaining separate
> counters for each source using an counter per enum type:
> 
> 	enum recovered_error_sources {
> 		ERR_GHES,
> 		ERR_MCE,
> 		ERR_AER,
> 		...
> 		ERR_NUM_SOURCES
> 	};
> 
> See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/
> 
> Do you think this would help you by any chance?
> 
> Thanks
> --breno

Personally, I think this approach would be more helpful. Additionally, I
suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
Errors) together. This is especially important for memory errors, as CEs
occur much more frequently than UEs, but their impact is much smaller.

Thanks.
Shuai