linux-kernel - Re: Hangs and reboots under high loads, oops with DEBUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <46B195EF.7090409@fsn.hu>
Date:	Thu, 02 Aug 2007 10:29:35 +0200
From:	Attila Nagy <bra@....hu>
To:	Roger Heflin <rheflin@...pa.com>
CC:	Alan Cox <alan@...rguk.ukuu.org.uk>, linux-kernel@...r.kernel.org
Subject: Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ

On 2007.08.01. 0:08, Roger Heflin wrote:
> Attila Nagy wrote:
>> HARDWARE ERROR
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 1 BANK 0 TSC 1167e915e93ce
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200004010000400 MCGSTATUS 5
>> This is not a software problem!
>> Run through mcelog --ascii to decode and contact your hardware vendor
>>
>> HARDWARE ERROR
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 1 BANK 5 TSC 1167e915e9ea8
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221024080400 MCGSTATUS 5
>> This is not a software problem!
>> Run through mcelog --ascii to decode and contact your hardware vendor
>
> Attila,
>
> We had some issues with very similar boards all of the problems
> seem to be around the PCIX bus area of the machine, setting the
> PCIX buses to 66 mhz in the bios made things stable (but slow).   Not 
> using
> the PCIX bus also seemed to make things work.   We got MCE's and
> other odd crashes under heavy IO loads.   I believe turning things
> down to 100mhz made things more stable, but things still crashed.
>
> Supermicro reported being able to fix the issue with:
> setting the PCI Configuration -> PCI-e I/O performance
> setting to Colasce 128B.
>
> I am not exactly sure where to set it as we did not try it
> as we had already changed to a different motherboard that did not
> have the issue.
>
> If this works please tell me.
Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro 
X7DBE motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: 
Coalesce and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.
I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use 
(SCST) has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.

Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/