linux-kernel - Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20060914190548.GI4610@chain.digitalkingdom.org>
Date:	Thu, 14 Sep 2006 12:05:48 -0700
From:	Robin Lee Powell <rlpowell@...italkingdom.org>
To:	linux-kernel@...r.kernel.org
Subject: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

On Tue, Sep 12, 2006 at 03:32:58PM -0700,  wrote:
> 
> Please cc me on replies, as I'm not on the list.

Nevermind; I'm watching the thread at
http://lkml.org/lkml/2006/9/12/300

> I have some moderately old hosts that hang on boot, very early on,
> with any kernel newer than 2.6.3.  Important basic facts about the
> box are dual opteron 244s, 16gb of RAM, and it's a 64-bit build of
> the kernel.

This isn't just me.  All the Debian kernels hang too.  I've tried
all of the following:

Linux version 2.6.8-12-amd64-generic (buildd@...ter) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:12:05 UTC 2006

Linux version 2.6.8-12-amd64-k8 (buildd@...ter) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 Mon Jul 17 01:39:03 UTC 2006

Linux version 2.6.8-12-amd64-k8-smp (buildd@...ter) (gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)) #1 SMP Mon Jul 17 00:17:20 UTC 2006

The boot messages differ only in trivial ways, and they all end
with:

- ----------

NET: Registered protocol family 16
CPU 0: Machine Check Exception:                7 Bank 3: b40000000000083b
RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
TSC d189cea ADDR fdfc000cfe
CPU 0: Machine Check Exception:                7 Bank 0: b40000000000083b
RIP 10:<ffffffff8023a44c> {pci_conf1_read+0xac/0xe0}
TSC 0
Kernel panic: Uncorrected machine check


- ----------

This happens on 4 different machine!  The only way to get them to
boot to a kernel later than 2.6.3 is "nomce".

I assert that an MCE on multiple machines that have been running
successfully on 2.6.2 for *years* is bad kernel behaviour; we have
no reason to believe that 4 different machines that happen to have
identical hardware have all developed the same hardware failure at
the same time.

Is there anything I can do here besides nomce?

Boot messages on Debian kernels from the primary host:

    amd64 generic:
    http://teddyb.org/~rlpowell/media/regular/lkml/amd64-generic-boot.txt

    amd64 k8:
    http://teddyb.org/~rlpowell/media/regular/lkml/amd64-k8-boot.txt

    amd64 k8 smp:
    http://teddyb.org/~rlpowell/media/regular/lkml/amd64-k8-smp-boot.txt

Boot messages from two other hosts booting Debian's 2.6.8-11 amd64
generic kernel; these hosts are used for massive report jobs in
production on a 2.6.2 kernel, so we *know* they work:

    http://teddyb.org/~rlpowell/media/regular/lkml/prodbing3-boot.txt

    http://teddyb.org/~rlpowell/media/regular/lkml/prodbing4-boot.txt

-Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/