[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de>
Date: Wed, 17 Sep 2025 17:33:29 +0200
From: Bert Karwatzki <spasswolf@....de>
To: Yazen Ghannam <yazen.ghannam@....com>
Cc: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
linux-kernel@...r.kernel.org, linux-next@...r.kernel.org,
linux-edac@...r.kernel.org, linux-acpi@...r.kernel.org, x86@...nel.org,
rafael@...nel.org, qiuxu.zhuo@...el.com, nik.borisov@...e.com,
Smita.KoralahalliChannabasappa@....com, spasswolf@....de
Subject: Re: spurious mce Hardware Error messages in next-20250912
Am Mittwoch, dem 17.09.2025 um 10:41 -0400 schrieb Yazen Ghannam:
> On Wed, Sep 17, 2025 at 09:13:11AM +0200, Bert Karwatzki wrote:
> > Am Dienstag, dem 16.09.2025 um 22:27 +0200 schrieb Bert Karwatzki:
> [...]
> >
> > I ran a test for 10h and got one real deferred error, I also looked through
> > older logs (which only go back to 2025-08-17) and they do not contain any
> > mce Hardware errors. Here's the output of
> >
> > $ dmesg | grep -E "mce|Hardware Error"
> > [...]
> > [10163.739261] [ T9326] mce: [Hardware Error]: Machine check events logged
> > [10163.739265] [ T9326] [Hardware Error]: Deferred error, no action required.
> > [10163.739267] [ T9326] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> > [10163.739275] [ T9326] [Hardware Error]: Error Addr: 0x0095464100000020
> > [10163.739276] [ T9326] [Hardware Error]: IPID: 0x000700b040000000
> > [10163.739278] [ T9326] [Hardware Error]: L3 Cache Ext. Error Code: 0
> > [10163.739279] [ T9326] [Hardware Error]: cache level: RESV, tx: INSN
> > [...]
This seems to be a real deferred errror.
>
> Summary so far:
> 1) Errors are found on CPU0 banks 11 and 14.
> 2) Errors are found during MCA timer-based polling.
> 3) The data is coming from MCA_DESTAT register.
> 4) The status bits are not consistent with documentation.
> 5) Likely these errors are not generating a deferred error interrupt.
>
> Bert, can you please collecting the following data?
>
> 1) Output of "/proc/interrupts".
> a) The MCE, MCP, THR, and DFR lines are of interest.
> b) We should verify if any other notification types occur besides
> "MCP" (MCA polling).
This is from next-20250916 (without the debug patch), unfortunately I've
already rebooted after the testrun with next-20250912 and your debug patch.
$ cat /proc/interrupts | grep -E "DFR|THR|MCE|MCP"
THR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Machine check exceptions
MCP: 39 39 39 39 39 39 39 39 39 39 39 39 39 39
39 39 Machine check polls
> 2) Using an older kernel, read the MCA_DESTAT registers for L3 cache.
> a) CPU0 bank 11: "sudo rdmsr -p 0 0xC00020b8"
> b) CPU0 bank 14: "sudo rdmsr -p 0 0xC00020e8"
> c) If these are non-zero, then I think we can confirm that the
> spurious data was always there.
>
> Thanks,
> Yazen
This is from 6.12.43+deb13-amd64 (the stock debian trixie kernel, currently the
oldest version I have installed):
# rdmsr -p 0 0xC00020b8
8700aa0800000000
# rdmsr -p 0 0xC00020e8
8700a28800000000
Bert Karwatzki
Powered by blists - more mailing lists