[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6de39a910607272228o26ab51cbw8aaa7215f5fadb8@mail.gmail.com>
Date: Thu, 27 Jul 2006 22:28:18 -0700
From: "Handle X" <xhandle@...il.com>
To: "Robert Hancock" <hancockr@...w.ca>
Cc: "Vikas Kedia" <kedia.vikas@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: Can we ignore errors in mcelog if the server is running fine
On 7/27/06, Robert Hancock <hancockr@...w.ca> wrote:
> Vikas Kedia wrote:
> > The server seems to be running fine. A. can I ignore the following
> > mcelog errors ? B. If not what should i do to stop the server from
> > reporting mcelog errors.
>
> Looks like data cache ECC errors, meaning the CPU 0 is faulty.
> Eventually if it's not replaced there will likely be some uncorrectable
> errors and the system will likely crash.
I am facing similar, but different errors.
[root@...yxsrv ~]# mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 89a560bb249
ADDR 1dfa49690
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC a6550f2d4de
ADDR 1de74b670
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit32 = err cpu0
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00120080813 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC afe4eba238a
ADDR 1d8049698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c00020080a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC cc945738d0a
ADDR 194c4b670
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 2021
bit40 = error found by scrub
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9410c10020080a13 MCGSTATUS 0
Repeats whenever I do any kind of operations...
How severe is ChipKill errors? Should I consider throwing away CPU 1
and get another one.
Regards,
Om.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists