lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20080215145007.GA18341@janus>
Date:	Fri, 15 Feb 2008 15:50:07 +0100
From:	Frank van Maarseveen <frankvm@...nkvm.com>
To:	Alan Cox <alan@...rguk.ukuu.org.uk>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Machine check exception with a kernel dependency

On Fri, Feb 15, 2008 at 01:22:41PM +0000, Alan Cox wrote:
> On Wed, 13 Feb 2008 17:25:28 +0100
> Frank van Maarseveen <frankvm@...nkvm.com> wrote:
> 
> > On at least two Dell optiplex 755 systems with a Core 2 Duo I get
> > 
> > Feb 13 15:14:01 inari CPU 1: Machine Check Exception: 0000000000000004 
> > Feb 13 15:14:01 inari CPU 0: Machine Check Exception: 0000000000000005 
> > Feb 13 15:14:01 inari Bank 0: b200004000000800 
> > Feb 13 15:14:01 inari Bank 5: b200221024080400 
> > 
> > 2.6.22.10 shows the problem, 2.6.24.2 ditto but I'm unable to reproduce
> > it with 2.6.24-rc8. BIOS upgrade didn't help. Removing all PCI[e] cards
> > didn't help either.
> 
> If you run the MCE numbers through a decoder what do you get back ?

I've some trouble decoding these in a convincing way. mcelog --core2
--ascii reports "MCG status:RIPV MCIP" for 0000000000000005 and "MCG
status:MCIP" for 0000000000000004.

I've collected several Bank # output lines:

#  text
---------------------------
26 Bank 0: b200004000000800
10 Bank 5: b200121014040400
 8 Bank 5: b200121020080400
 4 Bank 5: b200221010040400
 4 Bank 5: b200221024080400

but mcelog expects lines of the format

	CPU %u: Machine Check Exception: %16Lx Bank %d: %016Lx

(they got broken by netconsole) so I made these up:

CPU 1: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121014040400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121020080400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221010040400
CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221024080400

result:

CPU 1: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Originated-request Generic Memory-access Request-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout)
STATUS b200004000000800 MCGSTATUS 4

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121014040400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121014040400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200121020080400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200121020080400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221010040400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221010040400 MCGSTATUS 5

CPU 0: Machine Check Exception: 0000000000000005 Bank 5: b200221024080400
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 5 MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5


The problem also exists on an entirely different Xeon system with 4 cores:

cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X3210  @ 2.13GHz
stepping        : 11


-- 
Frank
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ