linux-kernel - Re: spurious mce Hardware Error messages in next-20250912

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <45d4081d93bbd50e1a23a112e3caca86ce979217.camel@web.de>
Date: Mon, 15 Sep 2025 23:03:45 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>, 
	linux-kernel@...r.kernel.org, linux-next@...r.kernel.org, 
	linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org, x86@...nel.org, 
	rafael@...nel.org, qiuxu.zhuo@...el.com, nik.borisov@...e.com, 
	Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912

Am Montag, dem 15.09.2025 um 13:55 -0400 schrieb Yazen Ghannam:
> 
> 
> You can change this interval by writing to this file:
> /sys/devices/system/machinecheck/machinecheck0/check_interval
> 
> Do the messages follow that setting? IOW, if you set it to '10', do you
> see error messages every 10 seconds?

Yes, if I set this to 10 I see these message every 10 seconds.

> > 
> > As these messages do not appear in v6.17-rc5 I bisected the issue 
> > (from v6.17-rc5 to next-20250912) and found this as the first bad commit:
> > 
> > cf6f155e848b ("x86/mce: Unify AMD DFR handler with MCA Polling")
> 
> Could you try another recent linux-next build without the MCA updates?
> 
> It looks like 'next-20250911' doesn't include the commit above.
> > 

Somehow I cannot find next-20250911 in my linux-next git:

$ git checkout next-202509(TAB TAB)
next-20250901   next-20250902   next-20250905   next-20250908   next-20250912   

I'm currently re-cloning linux-next.


> > Are these error messages a new error that was not reported previously or
> > are these error messages a sign that the new code erroneously reports errors?
> > 
> 
> It could be that the recent code updates broke something. Or there may
> be other kernel changes causing new, spurious errors.
> 
> We could also be picking up errors from the hardware that were
> previously ignored. I'll ask our hardware folks if this is a case we
> should address.

Perhaps these are errors which were not reported previously, when I check the
L3 cache error count I get this (these error_counts seem to be persistent across
reboots and also do not increase when I get an mce error message):

root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_0/l3_cache_0/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_1/l3_cache_1/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_2/l3_cache_2/error_count
9
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_3/l3_cache_3/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_4/l3_cache_4/error_count
72
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_5/l3_cache_5/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_6/l3_cache_6/error_count
3165
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_7/l3_cache_7/error_count
72

Bert Karwatzki