[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <703726.11655.qm@web50115.mail.yahoo.com>
Date: Fri, 19 Jan 2007 11:34:52 -0800 (PST)
From: Doug Thompson <norsk5@...oo.com>
To: Orion Poplawski <orion@...a.nwra.com>, linux-kernel@...r.kernel.org
Subject: Re: EDAC chipkill messages
--- Orion Poplawski <orion@...a.nwra.com> wrote:
> Robert Hancock wrote:
> > Orion Poplawski wrote:
> >> Can someone please explain to me what these mean?
> >>
> >> EDAC k8 MC1: general bus error: participating processor(local node
>
> >> origin), time-out(no timeout) memory transaction type(generic
> read),
> >> mem or i/o(mem access), cache level(generic)
> >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4,
> row
> >> 1, channel 0, label "": k8_edac
> >> EDAC MC1: CE - no information available: k8_edac Error Overflow
> set
> >> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> >>
> >> Thanks!
> >>
> >
> > Sounds like you're having some memory ECC errors.. some Memtest86,
> etc.
> > runs may be in order. You may be able to figure out from this info
> what
> > DIMM is having the problem.
> >
>
> That was my assumption as well, but was hoping someone could decode
> the
> above information and point me to the problem chip. I ran Memtest86
> overnight but found no problems, but don't know if it needs to run in
> a
> particular ECC mode.
>
> This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1
> is
> the controller on the second CPU. Would row 1 be the second DIMM?
No that would be the FIRST DIMM, on Channel 0
Each DIMM has 2 ChipSelect Rows (CSROW)
Each csrow covers two channels across, therefore on a 4 socket memory
array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and
3 on the second DIMM row.
WWWWWWWWW XXXXXXXXXXX
YYYYYYYYY ZZZZZZZZZZZ
The W and the Y DIMMs are channel 0
The X and the Z DIMMs are channel 1
csrows 0 and 1 would cross over Y and Z DIMMs
csrows 2 and 3 would cross over W and X DIMMs
The mapping problem occurs in then identifying each of the above goes
to which silk screen labeled sockets on the mobo.
Usually they are labeled:
H0_DIMM2A H0_DIMM2B
H0_DIMM1A H0_DIMM1B
where A is channel 0 and B is channel 1 and
the "DIMM1" would indicate the CSROWs 0 and 1
and "DIMM2" would indicate the CSROWs 2 and 3
The string 'label ""' can be filled in by a userspace script to
properly identify the DIMM silk screen according to the motherboard
used.
The lines with "EDAC MC1:" are EDAC CORE output messages, while the
"EDAC K8:" lines are EDAC Memory Controller driver messages.
"CE" is correctable error
MC1 is memory controller 1 (0 based)
ECC ChipKill x4 was what found the error and corrected it.
The FRU, (field replaceable unit) is the DIMM located at socket
H1_DIMM1A, according to the labeling I mentioned above.
caveat: the detector is not 100% perfect but gives a general area to
look at, the DIMM specification. Sometimes other errors can cause what
looks like a memory error, but usually a bad memory DIMM is the root
cause of the vast majority of such errors.
In addition, memtest86+ doesn't find all the bad memory in all cases,
but it is still a VERY useful tool
doug thompson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists