linux-kernel - Re: Dell XPS13: MCE (Hardware Error) reported

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <662102c9-94da-3193-08c4-9fe75411cadb@molgen.mpg.de>
Date:   Mon, 9 Jan 2017 12:53:33 +0100
From:   Paul Menzel <pmenzel@...gen.mpg.de>
To:     Ashok Raj <ashok.raj@...el.com>, Borislav Petkov <bp@...en8.de>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Thorsten Leemhuis <linux@...mhuis.info>,
        Len Brown <len.brown@...el.com>,
        Tony Luck <tony.luck@...el.com>
Subject: Re: Dell XPS13: MCE (Hardware Error) reported

Dear Ashosk, dear Borislav,


On 01/05/17 02:12, Raj, Ashok wrote:

>>> CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
>>> Hardware event. This is not a software error.
>>> MCE 1
>>> CPU 0 BANK 7
>>> MISC 7880018086 ADDR fef1ce40
>>> TIME 1483543069 Wed Jan  4 16:17:49 2017
>>> MCG status:
>>> MCi status:
>>> Error overflow
>>> Uncorrected error
>>> MCi_MISC register valid
>>> MCi_ADDR register valid
>>> Processor context corrupt
>>> MCA: corrected filtering (some unreported errors in same region)
>>> Generic CACHE Level-2 Generic Error
>>> STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.

To be clear, other than the message, the system is stable for me.

Here is `/proc/interrupts`.

```
$ more /proc/interrupts
             CPU0       CPU1       CPU2       CPU3
    0:         27          0          0          0  IR-IO-APIC    2-edge 
      timer
    1:          3          2        125          5  IR-IO-APIC    1-edge 
      i8042
    8:          0          1          0          0  IR-IO-APIC    8-edge 
      rtc0
    9:        108         31        397          5  IR-IO-APIC 
9-fasteoi   acpi
   12:         66         18         92         35  IR-IO-APIC   12-edge 
      i8042
   14:          0          0          0          0  IR-IO-APIC 
14-fasteoi   INT344B:00
   16:          0          0          0          0  IR-IO-APIC 
16-fasteoi   idma64.0, i801_smbus, i2c_designware.0
   17:        419         42        280        415  IR-IO-APIC 
17-fasteoi   idma64.1, i2c_designware.1
   51:          2          0          0          1  IR-IO-APIC 
51-fasteoi   DLL075B:01
  120:          0          0          0          0  DMAR-MSI    0-edge 
    dmar0
  121:          0          0          0          0  DMAR-MSI    1-edge 
    dmar1
  274:         17          2          0          4  IR-PCI-MSI 
30932992-edge      rtsx_pci
  275:         89         26         57         45  IR-PCI-MSI 
327680-edge      xhci_hcd
  276:       1886          0       2361          0  IR-PCI-MSI 
31457280-edge      nvme0q0, nvme0q1
  277:          0       3010       2570          0  IR-PCI-MSI 
31457281-edge      nvme0q2
  278:          0          0       2023       3480  IR-PCI-MSI 
31457282-edge      nvme0q3
  279:          0       3319          0       5863  IR-PCI-MSI 
31457283-edge      nvme0q4
  280:         45          0          0          0  IR-PCI-MSI 
360448-edge      mei_me
  281:        201         52       3008         85  IR-PCI-MSI 
32768-edge      i915
  282:        151         29        997      24821  IR-PCI-MSI 
30408704-edge      ath10k_pci
  283:        331        938        677        188  IR-PCI-MSI 
514048-edge      snd_hda_intel:card0
  NMI:          1          0          0          0   Non-maskable interrupts
  LOC:      15198      21708      16850      31954   Local timer interrupts
  SPU:          0          0          0          0   Spurious interrupts
  PMI:          1          0          0          0   Performance 
monitoring interrupts
  IWI:          3          0          0          0   IRQ work interrupts
  RTR:          0          0          0          0   APIC ICR read retries
  RES:       1329       1974       1532       1959   Rescheduling interrupts
  CAL:       2254       3827       1969       3963   Function call 
interrupts
  TLB:        396       2349        342       2193   TLB shootdowns
  TRM:          0          0          0          0   Thermal event 
interrupts
  THR:          0          0          0          0   Threshold APIC 
interrupts
  DFR:          0          0          0          0   Deferred Error APIC 
interrupts
  MCE:          0          0          0          0   Machine check 
exceptions
  MCP:          9          9          9          9   Machine check polls
  ERR:         17
  MIS:          0
  PIN:          0          0          0          0   Posted-interrupt 
notification event
  PIW:          0          0          0          0   Posted-interrupt 
wakeup event
```

> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
>>> MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.

No, I don’t. And everybody I talked to with a Dell XPS13 (9360) seems to 
have these errors.

> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.

I need some time for that.

> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

Sorry, I don’t know, as I am not the person from StackExchange [1].


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283