linux-kernel - Re: [HW PROBLEM] Intel I7 MCE. Erratum or not?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <493B4242.1040202@shaw.ca>
Date:	Sat, 06 Dec 2008 21:25:54 -0600
From:	Robert Hancock <hancockr@...w.ca>
To:	Giangiacomo Mariotti <gg.mariotti@...il.com>
CC:	linux-kernel@...r.kernel.org
Subject: Re: [HW PROBLEM] Intel I7 MCE. Erratum or not?

Giangiacomo Mariotti wrote:
> On Sat, Dec 6, 2008 at 10:47 PM, Robert Hancock <hancockr@...w.ca> wrote:
>> Giangiacomo Mariotti wrote:
>>> On Sat, Dec 6, 2008 at 9:58 PM, Robert Hancock <hancockr@...w.ca> wrote:
>>>> Giangiacomo Mariotti wrote:
>>>>> Hi everyone,
>>>>> Mcelog just logged on my new Intel I7 920 (on Linux 2.6.27.8) this :
>>>>> MCE 0
>>>>> HARDWARE ERROR. This is *NOT* a software problem!
>>>>> Please contact your hardware vendor
>>>>> CPU 0 BANK 6 MISC 202d ADDR ffeef740
>>>>> MCG status:
>>>>> MCi status:
>>>>> Error overflow
>>>>> Uncorrected error
>>>>> MCi_MISC register valid
>>>>> MCi_ADDR register valid
>>>>> Processor context corrupt
>>>>> MCA: Generic CACHE Level-2 Data-Write Error
>>>>> STATUS ee0000000100014a MCGSTATUS 0
>>>>>
>>>>> I'm reporting this here, because I found in the Intel I7 Technical
>>>>> Specification November 2008 update that something which seems very
>>>>> similar is in fact an erratum. So my question is : Is there any way
>>>>> for me to verify that my problem is due to one of those errata,instead
>>>>> of a broken hardware(if we don't want to consider all those errata as
>>>>> broken hardware)? I'm also reporting this because I thought it may be
>>>>> useful to signal that(if actually due to those errata) these problems
>>>>> actually occur, so it may be useful to find workarounds in the kernel
>>>>> to not scare to death poor Linux users!
>>>> Which erratum are you talking about? I don't see one in that document
>>>> that
>>>> would match this case..
>>>>
>>> Well, the first one seems very similar, even if it talks about a dtlb
>>> error instead of cache error. But sure,being similar doesn't mean too
>>> much. Number 52 seems similar too. I guess I should just give up and
>>> admit that my hardware is broken!
>>>
>> The first one is just indicating that if a DTLB error occurs the overflow
>> bit may be set incorrectly. It's not a false error though. The AAJ52 erratum
>> would only occur immediately after powerup or wake from sleep states.
>>
> The mce actually got logged once immediately after powerup and never
> more. Is that reasonable? A cache error which happens just once after
> boot?

The erratum refers to an internal parity error, not an L2 cache write error.

If it only happened once then who knows, could be a cosmic ray or 
something.. but if it happens again it sounds like you likely have a bad 
CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/