lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250916140744.GA1054485@yaz-khff2.amd.com>
Date: Tue, 16 Sep 2025 10:07:44 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: Bert Karwatzki <spasswolf@....de>, Tony Luck <tony.luck@...el.com>,
	linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
	linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org,
	x86@...nel.org, rafael@...nel.org, qiuxu.zhuo@...el.com,
	nik.borisov@...e.com, Smita.KoralahalliChannabasappa@....com
Subject: Re: spurious mce Hardware Error messages in next-20250912

On Tue, Sep 16, 2025 at 11:10:55AM +0200, Borislav Petkov wrote:
> On Mon, Sep 15, 2025 at 11:43:26PM +0200, Bert Karwatzki wrote:
> > After re-cloning linux-next I tested next-20250911 and I get no mce error messages
> > even if I set the check_interval to 10.
> 
> Yazen, I've zapped everything from the handler unification onwards:
> 
> 28e82d6f03b0 x86/mce: Save and use APEI corrected threshold limit
> c8f4cea38959 x86/mce: Handle AMD threshold interrupt storms
> 5a92e88ffc49 x86/mce/amd: Define threshold restart function for banks
> 922300abd79d x86/mce/amd: Remove redundant reset_block()
> 9b92e18973ce x86/mce/amd: Support SMCA corrected error interrupt
> fe02d3d00b06 x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
> cf6f155e848b x86/mce: Unify AMD DFR handler with MCA Polling
> 53b3be0e79ef x86/mce: Unify AMD THR handler with MCA Polling
> 
> until this is properly sorted out, now this close to the merge window.
> 
> Thanks, Bert, for reporting!
> 

No problem, thanks Boris.

Bert, can you please try the following patch on next-20250912?

I expect that you will see the "debug" message, but the regular MCA
logging should be gone.

Also, we haven't been able to reproduce this issue yet. So thank you for
your help. It's much appreciated.

Thanks,
Yazen

>From 6674a70f2369711aa28a20b88de7d89a9b6d03e0 Mon Sep 17 00:00:00 2001
From: Yazen Ghannam <yazen.ghannam@....com>
Date: Tue, 16 Sep 2025 09:44:24 -0400
Subject: [PATCH] x86/mce: Debug spurious MCA errors

Suspect that unexpected error information is present in MCA_DESTAT
register.

Print some info for debug.

Check for the "Deferred" status bit to decide if the deferred error
registers should be logged. Only deferred errors should be logged in
these registers, so anything else should be ignored.

Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
---
 arch/x86/kernel/cpu/mce/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6e48290a3844..741473ef7fdc 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -741,6 +741,11 @@ static bool smca_should_log_poll_error(enum mcp_flags flags, struct mce_hw_err *
 	if (!(m->status & MCI_STATUS_VAL))
 		return false;
 
+	pr_err("DEBUG: CPU%d Bank:%d Status:0x%016llx\n", m->extcpu, m->bank, m->status);
+
+	if (!(m->status & MCI_STATUS_DEFERRED))
+		return false;
+
 	m->kflags |= MCE_CHECK_DFR_REGS;
 	return true;
 }
-- 
2.51.0



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ