linux-kernel - Re: Opteron 6276 Corrected ECC errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [day] [month] [year] [list]

Date:	Tue, 5 Feb 2013 11:34:48 -0500
From:	Michael Madore <michael.madore@...il.com>
To:	linux-kernel@...r.kernel.org
Subject: Re: Opteron 6276 Corrected ECC errors

> On Wed, Jan 30, 2013 at 11:29:47AM -0500, Michael Madore wrote:
>> Supermicro H8QGi-F server board (AMD SR5690/SR5670/SP5100 Chipset)
>> 4 X AMD Opteron 6276 processors
>> 32 X 8GB (256GB) DDR3-1600 ECC Registered memory
>> Debian with kernel 3.2.35-2
>>
>> We have received the following two hardware errors:
>>
>> 9/10/12
>>
>> [591006.120039] [Hardware Error]: CPU:58
>> MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x9842c000000c0176
>> [591006.120048] [Hardware Error]: Combined Unit Error: VB Data/ECC error.
>> [591006.120052] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
>>
>> 1/21/12
>>
>> [549004.336097] [Hardware Error]: CPU:40
>> MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c3444e0001f010b
>> [549004.336111] [Hardware Error]:       MC4_ADDR: 0x000000000000e480
>> [549004.336117] [Hardware Error]: Northbridge Error (node 5): ECC
>> Error in the Probe Filter directory.
>> [549004.336125] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN
>>
>> If I understand correctly, both of these errors represent single bit
>> corrected errors in the CPU cache.
>
> Internal CPU structures, victim buffer the first and the second in the
> probe filter which is part of L3.
>
>> On both occasions the system continued to function normally after the
>> error was reported.
>
> As expected; both are single-bit ECC errors which were corrected and
> system state wasn't influenced.
>
>> Is receiving two such errors (on different CPUs) over such a time span
>> cause for concern?
>
> Not really. I'd say, only if the error rate starts increasing over time
> and the error types keep repeating.
>
>> The end user is concerned there is a serious hardware problem. I'm
>> reluctant to start replacing CPUs, however, without seeing a repeated
>> pattern of errors.
>
> Yes, no need to replace, simply watch the error rates. Maybe check the
> temperature of the CPUs, possibly improve cooling are some of the things
> that come to mind.

Hi Boris,

Thank you for the information.  The system has just received a third error:

[573603.432036] [Hardware Error]: CPU:32
MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9c43ccb0011c017b
[573603.432045] [Hardware Error]:  MC4_ADDR: 0x0000002782598940
[573603.432048] [Hardware Error]: Northbridge Error (node 4): L3 ECC
data cache error.
[573603.432054] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: EV

This is on a different node than the previous two errors.  And each
node has it's own L3, correct?  Would you still advocate watching and
waiting?

Thanks,

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/