[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250915175531.GB869676@yaz-khff2.amd.com>
Date: Mon, 15 Sep 2025 13:55:31 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Bert Karwatzki <spasswolf@....de>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org,
x86@...nel.org, rafael@...nel.org, qiuxu.zhuo@...el.com,
nik.borisov@...e.com, Smita.KoralahalliChannabasappa@....com
Subject: Re: spurious mce Hardware Error messages in next-20250912
On Mon, Sep 15, 2025 at 03:00:09AM +0200, Bert Karwatzki wrote:
> On my MSI Alpha 15 (amd64) laptop running debian stable(trixie) and
> kernel next-20250912 I noticed the following mce error message in demsg:
>
> [ T10] mce: [Hardware Error]: Machine check events logged
> [ T10] [Hardware Error]: Corrected error, no action required.
> [ T10] [Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|CE|-|AddrV|-|-|-|UECC|-|Poison|-]: 0x8400aa4800a90139
> [ T10] [Hardware Error]: Error Addr: 0x006637a200000020
> [ T10] [Hardware Error]: IPID: 0x000700b040000000
> [ T10] [Hardware Error]: L3 Cache Ext. Error Code: 41
> [ T10] [Hardware Error]: cache level: L1, tx: GEN, mem-tx: DRD
> [ T10] mce: [Hardware Error]: Machine check events logged
> [ T10] [Hardware Error]: Corrected error, no action required.
> [ T10] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|CE|-|AddrV|PCC|-|SyndV|UECC|-|Poison|-]: 0x8724ac0800000000
> [ T10] [Hardware Error]: Error Addr: 0x002bf52e00000020
> [ T10] [Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042
> [ T10]
> [ T10] [Hardware Error]: L3 Cache Ext. Error Code: 0
> [ T10] [Hardware Error]: cache level: RESV, tx: INSN
The error messages are very odd. The MCA_STATUS bits are inconsistent.
They show "corrected" errors with "uncorrected" bits like "PCC" and
"Poison".
>
> The messages start about 333.34s after boot and usually appear 327.68s appart
> (Yes, these timings are reproducible!):
This is likely because the errors are found during MCA polling. The
default polling interval is 300 seconds. There may be some drift if
other tasks are scheduled at the same time.
You can change this interval by writing to this file:
/sys/devices/system/machinecheck/machinecheck0/check_interval
Do the messages follow that setting? IOW, if you set it to '10', do you
see error messages every 10 seconds?
>
> As these messages do not appear in v6.17-rc5 I bisected the issue
> (from v6.17-rc5 to next-20250912) and found this as the first bad commit:
>
> cf6f155e848b ("x86/mce: Unify AMD DFR handler with MCA Polling")
Could you try another recent linux-next build without the MCA updates?
It looks like 'next-20250911' doesn't include the commit above.
>
> Are these error messages a new error that was not reported previously or
> are these error messages a sign that the new code erroneously reports errors?
>
It could be that the recent code updates broke something. Or there may
be other kernel changes causing new, spurious errors.
We could also be picking up errors from the hardware that were
previously ignored. I'll ask our hardware folks if this is a case we
should address.
Thanks for reporting this!
-Yazen
Powered by blists - more mailing lists