linux-kernel - Re: 2.6.34 Northbridge Chipset Errors on HP Proliant 4 x Opteron in x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTinu1rzaHY4udBxVWRTCHB0XDicF5tz6FFnsUhfD@mail.gmail.com>
Date:	Wed, 30 Jun 2010 08:47:36 -0600
From:	Jeffrey Merkey <jeffmerkey@...il.com>
To:	Borislav Petkov <bp@...en8.de>,
	Jeffrey Merkey <jeffmerkey@...il.com>,
	linux-kernel@...r.kernel.org
Subject: Re: 2.6.34 Northbridge Chipset Errors on HP Proliant 4 x Opteron in 
	x86_64 mode

On Wed, Jun 30, 2010 at 12:38 AM, Borislav Petkov <bp@...en8.de> wrote:
> From: Jeffrey Merkey <jeffmerkey@...il.com>
> Date: Tue, Jun 29, 2010 at 03:13:03PM -0600
>
>> On a 4 x Opteron HP Proliant Server with a CCISS array controller in
>> x86_64 mode, under very heavy (saturated) disk IO, 2.6.34 reports the
>> following error:
>>
>> Jun 29 02:02:08  kernel: Northbridge Error, node 0, core: 0
>> Jun 29 02:02:08  kernel: ECC/ChipKill ECC error.
>> Jun 29 02:02:08  kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
>> Jun 29 02:02:08  kernel: EDAC amd64: get_channel_from_ecc_syndrome:
>> error reading F3x180.
>> Jun 29 02:02:08  kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
>> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
>> Jun 29 02:03:21  kernel: Northbridge Error, node 0
>> Jun 29 02:03:21  kernel: ECC/ChipKill ECC error.
>> Jun 29 02:03:21  kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
>> Jun 29 02:03:21  kernel: EDAC amd64: get_channel_from_ecc_syndrome:
>> error reading F3x180.
>
> It looks like you don't have extended PCI config space accesses enabled
> on that machine. Can you send me the whole dmesg?
>
>> Jun 29 02:03:21  kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
>> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
>>
>> The error is reproduceable by subjecting the server to excessive disk
>> loads > 350 MB/S stream to disk.
>
> DRAM ECC errors. It looks most probably like the first DIMM on node 0,
> whichever that is, might be slowly failing.
>
> Pinpointing it is not that straightforward, here's what you can do:
>
> Try to figure which it is by looking at the silkscreen labels on the
> motherboard. They're normally named like "DIMM_Ax" where x is in (1,
> 2, ...) or "DIMM_Bx" or a similar scheme. If the layout on the mobo is
> sane, I'm guessing the first DIMM in that naming scheme should be it.
> Try swapping it out to see if the errors disappear.
>
> --
> Regards/Gruss,
>    Boris.
>

This makes sense.  I replaced the DIMM modules in this unit because
one of them had failed.  Looks like its twin
is slowly failing as well.  I have a spare and will replce today and
see if the error persists.

Thanks

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/