[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinu1rzaHY4udBxVWRTCHB0XDicF5tz6FFnsUhfD@mail.gmail.com>
Date: Wed, 30 Jun 2010 08:47:36 -0600
From: Jeffrey Merkey <jeffmerkey@...il.com>
To: Borislav Petkov <bp@...en8.de>,
Jeffrey Merkey <jeffmerkey@...il.com>,
linux-kernel@...r.kernel.org
Subject: Re: 2.6.34 Northbridge Chipset Errors on HP Proliant 4 x Opteron in
x86_64 mode
On Wed, Jun 30, 2010 at 12:38 AM, Borislav Petkov <bp@...en8.de> wrote:
> From: Jeffrey Merkey <jeffmerkey@...il.com>
> Date: Tue, Jun 29, 2010 at 03:13:03PM -0600
>
>> On a 4 x Opteron HP Proliant Server with a CCISS array controller in
>> x86_64 mode, under very heavy (saturated) disk IO, 2.6.34 reports the
>> following error:
>>
>> Jun 29 02:02:08 kernel: Northbridge Error, node 0, core: 0
>> Jun 29 02:02:08 kernel: ECC/ChipKill ECC error.
>> Jun 29 02:02:08 kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
>> Jun 29 02:02:08 kernel: EDAC amd64: get_channel_from_ecc_syndrome:
>> error reading F3x180.
>> Jun 29 02:02:08 kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
>> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
>> Jun 29 02:03:21 kernel: Northbridge Error, node 0
>> Jun 29 02:03:21 kernel: ECC/ChipKill ECC error.
>> Jun 29 02:03:21 kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
>> Jun 29 02:03:21 kernel: EDAC amd64: get_channel_from_ecc_syndrome:
>> error reading F3x180.
>
> It looks like you don't have extended PCI config space accesses enabled
> on that machine. Can you send me the whole dmesg?
>
>> Jun 29 02:03:21 kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
>> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
>>
>> The error is reproduceable by subjecting the server to excessive disk
>> loads > 350 MB/S stream to disk.
>
> DRAM ECC errors. It looks most probably like the first DIMM on node 0,
> whichever that is, might be slowly failing.
>
> Pinpointing it is not that straightforward, here's what you can do:
>
> Try to figure which it is by looking at the silkscreen labels on the
> motherboard. They're normally named like "DIMM_Ax" where x is in (1,
> 2, ...) or "DIMM_Bx" or a similar scheme. If the layout on the mobo is
> sane, I'm guessing the first DIMM in that naming scheme should be it.
> Try swapping it out to see if the errors disappear.
>
> --
> Regards/Gruss,
> Boris.
>
This makes sense. I replaced the DIMM modules in this unit because
one of them had failed. Looks like its twin
is slowly failing as well. I have a spare and will replce today and
see if the error persists.
Thanks
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists