[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45d4081d93bbd50e1a23a112e3caca86ce979217.camel@web.de>
Date: Mon, 15 Sep 2025 23:03:45 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org, x86@...nel.org,
rafael@...nel.org, qiuxu.zhuo@...el.com, nik.borisov@...e.com,
Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912
Am Montag, dem 15.09.2025 um 13:55 -0400 schrieb Yazen Ghannam:
>
>
> You can change this interval by writing to this file:
> /sys/devices/system/machinecheck/machinecheck0/check_interval
>
> Do the messages follow that setting? IOW, if you set it to '10', do you
> see error messages every 10 seconds?
Yes, if I set this to 10 I see these message every 10 seconds.
> >
> > As these messages do not appear in v6.17-rc5 I bisected the issue
> > (from v6.17-rc5 to next-20250912) and found this as the first bad commit:
> >
> > cf6f155e848b ("x86/mce: Unify AMD DFR handler with MCA Polling")
>
> Could you try another recent linux-next build without the MCA updates?
>
> It looks like 'next-20250911' doesn't include the commit above.
> >
Somehow I cannot find next-20250911 in my linux-next git:
$ git checkout next-202509(TAB TAB)
next-20250901 next-20250902 next-20250905 next-20250908 next-20250912
I'm currently re-cloning linux-next.
> > Are these error messages a new error that was not reported previously or
> > are these error messages a sign that the new code erroneously reports errors?
> >
>
> It could be that the recent code updates broke something. Or there may
> be other kernel changes causing new, spurious errors.
>
> We could also be picking up errors from the hardware that were
> previously ignored. I'll ask our hardware folks if this is a case we
> should address.
Perhaps these are errors which were not reported previously, when I check the
L3 cache error count I get this (these error_counts seem to be persistent across
reboots and also do not increase when I get an mce error message):
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_0/l3_cache_0/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_1/l3_cache_1/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_2/l3_cache_2/error_count
9
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_3/l3_cache_3/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_4/l3_cache_4/error_count
72
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_5/l3_cache_5/error_count
0
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_6/l3_cache_6/error_count
3165
root@...a:~# cat /sys/devices/system/machinecheck/machinecheck0/l3_cache_7/l3_cache_7/error_count
72
Bert Karwatzki
Powered by blists - more mailing lists