lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9488e4bf935aa1e50179019419dfee93d306ded9.camel@web.de>
Date: Tue, 16 Sep 2025 22:27:35 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>, Borislav Petkov <bp@...en8.de>
Cc: Tony Luck <tony.luck@...el.com>, linux-kernel@...r.kernel.org, 
	linux-next@...r.kernel.org, linux-edac@...r.kernel.org, 
	linux-acpi@...r.kernel.org, x86@...nel.org, rafael@...nel.org, 
	qiuxu.zhuo@...el.com, nik.borisov@...e.com, 
	Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912

Am Dienstag, dem 16.09.2025 um 10:07 -0400 schrieb Yazen Ghannam:
> On Tue, Sep 16, 2025 at 11:10:55AM +0200, Borislav Petkov wrote:
> > On Mon, Sep 15, 2025 at 11:43:26PM +0200, Bert Karwatzki wrote:
> > > After re-cloning linux-next I tested next-20250911 and I get no mce error messages
> > > even if I set the check_interval to 10.
> > 
> > Yazen, I've zapped everything from the handler unification onwards:
> > 
> > 28e82d6f03b0 x86/mce: Save and use APEI corrected threshold limit
> > c8f4cea38959 x86/mce: Handle AMD threshold interrupt storms
> > 5a92e88ffc49 x86/mce/amd: Define threshold restart function for banks
> > 922300abd79d x86/mce/amd: Remove redundant reset_block()
> > 9b92e18973ce x86/mce/amd: Support SMCA corrected error interrupt
> > fe02d3d00b06 x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
> > cf6f155e848b x86/mce: Unify AMD DFR handler with MCA Polling
> > 53b3be0e79ef x86/mce: Unify AMD THR handler with MCA Polling
> > 
> > until this is properly sorted out, now this close to the merge window.
> > 
> > Thanks, Bert, for reporting!
> > 
> 
> No problem, thanks Boris.
> 
> Bert, can you please try the following patch on next-20250912?
> 
> I expect that you will see the "debug" message, but the regular MCA
> logging should be gone.
> 

Applied your patch on next-20250912, these are now the only messages
I get from mce:

[  333.337544] [      C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700aa0800000000
[  333.337556] [      C0] mce: DEBUG: CPU0 Bank:14 Status:0x8724aa0800000000
[  661.017608] [      C0] mce: DEBUG: CPU0 Bank:11 Status:0x8424aa4800a9413b
[  661.017619] [      C0] mce: DEBUG: CPU0 Bank:14 Status:0x8700aa0800000000
[  988.697243] [      C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700aa0800000000
[  988.697250] [      C0] mce: DEBUG: CPU0 Bank:14 Status:0x8724ab8800000000
[ 1316.377571] [      C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700a28800000000
[ 1316.377582] [      C0] mce: DEBUG: CPU0 Bank:14 Status:0x8400aa4800a7413c


> Also, we haven't been able to reproduce this issue yet. So thank you for
> your help. It's much appreciated.
> 
> Thanks,
> Yazen
> 

It could still be a hardware error, I'm also going to run memtest86+. 

Bert Karwatzki

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ