linux-kernel - Re: spurious mce Hardware Error messages in next-20250912

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de>
Date: Wed, 17 Sep 2025 17:33:29 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>, 
	linux-kernel@...r.kernel.org, linux-next@...r.kernel.org, 
	linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org, x86@...nel.org, 
	rafael@...nel.org, qiuxu.zhuo@...el.com, nik.borisov@...e.com, 
	Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912

Am Mittwoch, dem 17.09.2025 um 10:41 -0400 schrieb Yazen Ghannam:
> On Wed, Sep 17, 2025 at 09:13:11AM +0200, Bert Karwatzki wrote:
> > Am Dienstag, dem 16.09.2025 um 22:27 +0200 schrieb Bert Karwatzki:
> [...]
> > 
> > I ran a test for 10h and got one real deferred error, I also looked through
> > older logs (which only go back to 2025-08-17) and they do not contain any
> > mce Hardware errors. Here's the output of
> > 
> > $ dmesg | grep -E "mce|Hardware Error"
> > [...]
> > [10163.739261] [   T9326] mce: [Hardware Error]: Machine check events logged
> > [10163.739265] [   T9326] [Hardware Error]: Deferred error, no action required.
> > [10163.739267] [   T9326] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> > [10163.739275] [   T9326] [Hardware Error]: Error Addr: 0x0095464100000020
> > [10163.739276] [   T9326] [Hardware Error]: IPID: 0x000700b040000000
> > [10163.739278] [   T9326] [Hardware Error]: L3 Cache Ext. Error Code: 0
> > [10163.739279] [   T9326] [Hardware Error]: cache level: RESV, tx: INSN
> > [...]

This seems to be a real deferred errror.

> 
> Summary so far:
> 1) Errors are found on CPU0 banks 11 and 14.
> 2) Errors are found during MCA timer-based polling.
> 3) The data is coming from MCA_DESTAT register.
> 4) The status bits are not consistent with documentation.
> 5) Likely these errors are not generating a deferred error interrupt.
> 
> Bert, can you please collecting the following data?
> 
> 1) Output of "/proc/interrupts".
>   a) The MCE, MCP, THR, and DFR lines are of interest.
>   b) We should verify if any other notification types occur besides
>      "MCP" (MCA polling).

This is from next-20250916 (without the debug patch), unfortunately I've
already rebooted after the testrun with next-20250912 and your debug patch.

$ cat /proc/interrupts | grep -E "DFR|THR|MCE|MCP"
 THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Machine check exceptions
 MCP:         39         39         39         39         39         39         39         39         39         39         39         39         39         39
39         39   Machine check polls



> 2) Using an older kernel, read the MCA_DESTAT registers for L3 cache.
>   a) CPU0 bank 11: "sudo rdmsr -p 0 0xC00020b8"
>   b) CPU0 bank 14: "sudo rdmsr -p 0 0xC00020e8"
>   c) If these are non-zero, then I think we can confirm that the
>      spurious data was always there.
> 
> Thanks,
> Yazen

This is from 6.12.43+deb13-amd64 (the stock debian trixie kernel, currently the
oldest version I have installed):

# rdmsr -p 0 0xC00020b8
8700aa0800000000
# rdmsr -p 0 0xC00020e8
8700a28800000000


Bert Karwatzki