linux-kernel - Re: spurious mce Hardware Error messages in next-20250912

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250917211509.GB1610597@yaz-khff2.amd.com>
Date: Wed, 17 Sep 2025 17:15:09 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Bert Karwatzki <spasswolf@....de>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
	linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
	linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org,
	x86@...nel.org, rafael@...nel.org, qiuxu.zhuo@...el.com,
	nik.borisov@...e.com, Smita.KoralahalliChannabasappa@....com
Subject: Re: spurious mce Hardware Error messages in next-20250912

On Wed, Sep 17, 2025 at 03:26:52PM -0400, Yazen Ghannam wrote:
> On Wed, Sep 17, 2025 at 05:33:29PM +0200, Bert Karwatzki wrote:
> > Am Mittwoch, dem 17.09.2025 um 10:41 -0400 schrieb Yazen Ghannam:
> > > On Wed, Sep 17, 2025 at 09:13:11AM +0200, Bert Karwatzki wrote:
> > > > Am Dienstag, dem 16.09.2025 um 22:27 +0200 schrieb Bert Karwatzki:
> > > [...]
> > > > 
> > > > I ran a test for 10h and got one real deferred error, I also looked through
> > > > older logs (which only go back to 2025-08-17) and they do not contain any
> > > > mce Hardware errors. Here's the output of
> > > > 
> > > > $ dmesg | grep -E "mce|Hardware Error"
> > > > [...]
> > > > [10163.739261] [   T9326] mce: [Hardware Error]: Machine check events logged
> > > > [10163.739265] [   T9326] [Hardware Error]: Deferred error, no action required.
> > > > [10163.739267] [   T9326] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> > > > [10163.739275] [   T9326] [Hardware Error]: Error Addr: 0x0095464100000020
> > > > [10163.739276] [   T9326] [Hardware Error]: IPID: 0x000700b040000000
> > > > [10163.739278] [   T9326] [Hardware Error]: L3 Cache Ext. Error Code: 0
> > > > [10163.739279] [   T9326] [Hardware Error]: cache level: RESV, tx: INSN
> > > > [...]
> > 
> > This seems to be a real deferred errror.
> 
> The "Deferred" status bit is set, but that seems to be coincidence. This
> error code shouldn't have this bit set. Likewise, in previous examples
> we saw the "Poison" status bit set when it shouldn't be.
> 
> > 
> > > 
> > > Summary so far:
> > > 1) Errors are found on CPU0 banks 11 and 14.
> > > 2) Errors are found during MCA timer-based polling.
> > > 3) The data is coming from MCA_DESTAT register.
> > > 4) The status bits are not consistent with documentation.
> > > 5) Likely these errors are not generating a deferred error interrupt.
> > > 
> > > Bert, can you please collecting the following data?
> > > 
> > > 1) Output of "/proc/interrupts".
> > >   a) The MCE, MCP, THR, and DFR lines are of interest.
> > >   b) We should verify if any other notification types occur besides
> > >      "MCP" (MCA polling).
> > 
> > This is from next-20250916 (without the debug patch), unfortunately I've
> > already rebooted after the testrun with next-20250912 and your debug patch.
> > 
> > $ cat /proc/interrupts | grep -E "DFR|THR|MCE|MCP"
> >  THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
> > 0          0   Threshold APIC interrupts
> >  DFR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
> > 0          0   Deferred Error APIC interrupts
> >  MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
> > 0          0   Machine check exceptions
> >  MCP:         39         39         39         39         39         39         39         39         39         39         39         39         39         39
> > 39         39   Machine check polls
> > 
> > 
> > 
> > > 2) Using an older kernel, read the MCA_DESTAT registers for L3 cache.
> > >   a) CPU0 bank 11: "sudo rdmsr -p 0 0xC00020b8"
> > >   b) CPU0 bank 14: "sudo rdmsr -p 0 0xC00020e8"
> > >   c) If these are non-zero, then I think we can confirm that the
> > >      spurious data was always there.
> > > 
> > > Thanks,
> > > Yazen
> > 
> > This is from 6.12.43+deb13-amd64 (the stock debian trixie kernel, currently the
> > oldest version I have installed):
> > 
> > # rdmsr -p 0 0xC00020b8
> > 8700aa0800000000
> > # rdmsr -p 0 0xC00020e8
> > 8700a28800000000
> > 
> 
> Right, so it seems we have bogus data logged in these registers. And
> this is unrelated to the recent patches.
> 
> We have some combination of bits set in MCA_DESTAT registers. The
> deferred error interrupt hasn't fired (at least from the latest
> example).
> 
> There does seem to be some combination of bits that are always set and
> others flip between examples.
> 
> I'll highlight this to our hardware folks. But I don't think there's
> much we can do other than filter these out somehow.
> 
> I can add two checks to the patch to make it more like the current
> behavior.
> 
> 1) Check for 'Deferred' status bit when logging from the MCA_DESTAT.
> This was in the debug patch I shared.
> 2) Only check MCA_DESTAT when we are notified by the deferred error
> interrupt.
> 
> Technically, both of these shouldn't be necessary based on the
> architecture.
> 
> So there's a third option: add this error signature to our filter_mce()
> function.
> 
> As I write this out, I feel more inclined to option #3. I think it would
> be better to stick to the architecture. We may get error reports like
> this. But that may be preferable so that any potential hardware issues
> can be investigated sooner. If there's a real problem, better to get it
> fixed in future products rather than implicitly mask it by our code
> flow.
> 
> Any thoughts from others?
> 

Bert, can you please run the following script to print all MCA
registers? We'd like to see if there are any other unusual values.

Also, can you please share the complete dmesg output from any boot?

Thanks,
Yazen


#!/bin/bash

regnames=(
		"CTL"
		"STATUS"
		"ADDR"
		"MISC0"
		"CONFIG"
		"IPID"
		"SYND"
		"RESV"
		"DESTAT"
		"DEADDR"
		"MISC1"
		"MISC2"
		"MISC3"
		"MISC4"
		"SYND1"
		"SYND2"
	 )

for bank in $(seq 0 31)
do
	echo Bank ${bank}
	for reg in $(seq 0 15)
	do
		echo -n "${regnames[$reg]}:	"
		rdmsr -p 0 -c0x $(printf 0x%x $((0xC0002000 + 0x10 * bank + reg)))
	done
	echo
done