lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100630063844.GB27891@liondog.tnic>
Date:	Wed, 30 Jun 2010 08:38:44 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Jeffrey Merkey <jeffmerkey@...il.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: 2.6.34 Northbridge Chipset Errors on HP Proliant 4 x Opteron
 in x86_64 mode

From: Jeffrey Merkey <jeffmerkey@...il.com>
Date: Tue, Jun 29, 2010 at 03:13:03PM -0600

> On a 4 x Opteron HP Proliant Server with a CCISS array controller in
> x86_64 mode, under very heavy (saturated) disk IO, 2.6.34 reports the
> following error:
> 
> Jun 29 02:02:08  kernel: Northbridge Error, node 0, core: 0
> Jun 29 02:02:08  kernel: ECC/ChipKill ECC error.
> Jun 29 02:02:08  kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
> Jun 29 02:02:08  kernel: EDAC amd64: get_channel_from_ecc_syndrome:
> error reading F3x180.
> Jun 29 02:02:08  kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
> Jun 29 02:03:21  kernel: Northbridge Error, node 0
> Jun 29 02:03:21  kernel: ECC/ChipKill ECC error.
> Jun 29 02:03:21  kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
> Jun 29 02:03:21  kernel: EDAC amd64: get_channel_from_ecc_syndrome:
> error reading F3x180.

It looks like you don't have extended PCI config space accesses enabled
on that machine. Can you send me the whole dmesg?

> Jun 29 02:03:21  kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
> 
> The error is reproduceable by subjecting the server to excessive disk
> loads > 350 MB/S stream to disk.

DRAM ECC errors. It looks most probably like the first DIMM on node 0,
whichever that is, might be slowly failing.

Pinpointing it is not that straightforward, here's what you can do:

Try to figure which it is by looking at the silkscreen labels on the
motherboard. They're normally named like "DIMM_Ax" where x is in (1,
2, ...) or "DIMM_Bx" or a similar scheme. If the layout on the mobo is
sane, I'm guessing the first DIMM in that naming scheme should be it.
Try swapping it out to see if the errors disappear.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ