[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9488e4bf935aa1e50179019419dfee93d306ded9.camel@web.de>
Date: Tue, 16 Sep 2025 22:27:35 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>, Borislav Petkov <bp@...en8.de>
Cc: Tony Luck <tony.luck@...el.com>, linux-kernel@...r.kernel.org,
linux-next@...r.kernel.org, linux-edac@...r.kernel.org,
linux-acpi@...r.kernel.org, x86@...nel.org, rafael@...nel.org,
qiuxu.zhuo@...el.com, nik.borisov@...e.com,
Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912
Am Dienstag, dem 16.09.2025 um 10:07 -0400 schrieb Yazen Ghannam:
> On Tue, Sep 16, 2025 at 11:10:55AM +0200, Borislav Petkov wrote:
> > On Mon, Sep 15, 2025 at 11:43:26PM +0200, Bert Karwatzki wrote:
> > > After re-cloning linux-next I tested next-20250911 and I get no mce error messages
> > > even if I set the check_interval to 10.
> >
> > Yazen, I've zapped everything from the handler unification onwards:
> >
> > 28e82d6f03b0 x86/mce: Save and use APEI corrected threshold limit
> > c8f4cea38959 x86/mce: Handle AMD threshold interrupt storms
> > 5a92e88ffc49 x86/mce/amd: Define threshold restart function for banks
> > 922300abd79d x86/mce/amd: Remove redundant reset_block()
> > 9b92e18973ce x86/mce/amd: Support SMCA corrected error interrupt
> > fe02d3d00b06 x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
> > cf6f155e848b x86/mce: Unify AMD DFR handler with MCA Polling
> > 53b3be0e79ef x86/mce: Unify AMD THR handler with MCA Polling
> >
> > until this is properly sorted out, now this close to the merge window.
> >
> > Thanks, Bert, for reporting!
> >
>
> No problem, thanks Boris.
>
> Bert, can you please try the following patch on next-20250912?
>
> I expect that you will see the "debug" message, but the regular MCA
> logging should be gone.
>
Applied your patch on next-20250912, these are now the only messages
I get from mce:
[ 333.337544] [ C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700aa0800000000
[ 333.337556] [ C0] mce: DEBUG: CPU0 Bank:14 Status:0x8724aa0800000000
[ 661.017608] [ C0] mce: DEBUG: CPU0 Bank:11 Status:0x8424aa4800a9413b
[ 661.017619] [ C0] mce: DEBUG: CPU0 Bank:14 Status:0x8700aa0800000000
[ 988.697243] [ C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700aa0800000000
[ 988.697250] [ C0] mce: DEBUG: CPU0 Bank:14 Status:0x8724ab8800000000
[ 1316.377571] [ C0] mce: DEBUG: CPU0 Bank:11 Status:0x8700a28800000000
[ 1316.377582] [ C0] mce: DEBUG: CPU0 Bank:14 Status:0x8400aa4800a7413c
> Also, we haven't been able to reproduce this issue yet. So thank you for
> your help. It's much appreciated.
>
> Thanks,
> Yazen
>
It could still be a hardware error, I'm also going to run memtest86+.
Bert Karwatzki
Powered by blists - more mailing lists