[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <662102c9-94da-3193-08c4-9fe75411cadb@molgen.mpg.de>
Date: Mon, 9 Jan 2017 12:53:33 +0100
From: Paul Menzel <pmenzel@...gen.mpg.de>
To: Ashok Raj <ashok.raj@...el.com>, Borislav Petkov <bp@...en8.de>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Thorsten Leemhuis <linux@...mhuis.info>,
Len Brown <len.brown@...el.com>,
Tony Luck <tony.luck@...el.com>
Subject: Re: Dell XPS13: MCE (Hardware Error) reported
Dear Ashosk, dear Borislav,
On 01/05/17 02:12, Raj, Ashok wrote:
>>> CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
>>> Hardware event. This is not a software error.
>>> MCE 1
>>> CPU 0 BANK 7
>>> MISC 7880018086 ADDR fef1ce40
>>> TIME 1483543069 Wed Jan 4 16:17:49 2017
>>> MCG status:
>>> MCi status:
>>> Error overflow
>>> Uncorrected error
>>> MCi_MISC register valid
>>> MCi_ADDR register valid
>>> Processor context corrupt
>>> MCA: corrected filtering (some unreported errors in same region)
>>> Generic CACHE Level-2 Generic Error
>>> STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
To be clear, other than the message, the system is stable for me.
Here is `/proc/interrupts`.
```
$ more /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 27 0 0 0 IR-IO-APIC 2-edge
timer
1: 3 2 125 5 IR-IO-APIC 1-edge
i8042
8: 0 1 0 0 IR-IO-APIC 8-edge
rtc0
9: 108 31 397 5 IR-IO-APIC
9-fasteoi acpi
12: 66 18 92 35 IR-IO-APIC 12-edge
i8042
14: 0 0 0 0 IR-IO-APIC
14-fasteoi INT344B:00
16: 0 0 0 0 IR-IO-APIC
16-fasteoi idma64.0, i801_smbus, i2c_designware.0
17: 419 42 280 415 IR-IO-APIC
17-fasteoi idma64.1, i2c_designware.1
51: 2 0 0 1 IR-IO-APIC
51-fasteoi DLL075B:01
120: 0 0 0 0 DMAR-MSI 0-edge
dmar0
121: 0 0 0 0 DMAR-MSI 1-edge
dmar1
274: 17 2 0 4 IR-PCI-MSI
30932992-edge rtsx_pci
275: 89 26 57 45 IR-PCI-MSI
327680-edge xhci_hcd
276: 1886 0 2361 0 IR-PCI-MSI
31457280-edge nvme0q0, nvme0q1
277: 0 3010 2570 0 IR-PCI-MSI
31457281-edge nvme0q2
278: 0 0 2023 3480 IR-PCI-MSI
31457282-edge nvme0q3
279: 0 3319 0 5863 IR-PCI-MSI
31457283-edge nvme0q4
280: 45 0 0 0 IR-PCI-MSI
360448-edge mei_me
281: 201 52 3008 85 IR-PCI-MSI
32768-edge i915
282: 151 29 997 24821 IR-PCI-MSI
30408704-edge ath10k_pci
283: 331 938 677 188 IR-PCI-MSI
514048-edge snd_hda_intel:card0
NMI: 1 0 0 0 Non-maskable interrupts
LOC: 15198 21708 16850 31954 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 1 0 0 0 Performance
monitoring interrupts
IWI: 3 0 0 0 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 1329 1974 1532 1959 Rescheduling interrupts
CAL: 2254 3827 1969 3963 Function call
interrupts
TLB: 396 2349 342 2193 TLB shootdowns
TRM: 0 0 0 0 Thermal event
interrupts
THR: 0 0 0 0 Threshold APIC
interrupts
DFR: 0 0 0 0 Deferred Error APIC
interrupts
MCE: 0 0 0 0 Machine check
exceptions
MCP: 9 9 9 9 Machine check polls
ERR: 17
MIS: 0
PIN: 0 0 0 0 Posted-interrupt
notification event
PIW: 0 0 0 0 Posted-interrupt
wakeup event
```
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases. In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
>>> MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
No, I don’t. And everybody I talked to with a Dell XPS13 (9360) seems to
have these errors.
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
I need some time for that.
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?
Sorry, I don’t know, as I am not the person from StackExchange [1].
Kind regards,
Paul
[1]
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
Powered by blists - more mailing lists