[<prev] [next>] [day] [month] [year] [list]
Message-ID: <5141FCE4.8080107@geizhals.at>
Date: Thu, 14 Mar 2013 17:37:56 +0100
From: Johannes Truschnigg <johannes.truschnigg@...zhals.at>
To: linux-kernel@...r.kernel.org
Subject: Curious (non-trivial?) stability problem with two identical systems
Hi LKML folks,
I have a curious problem with a Tyan S8228-based, two-node system (data
sheet: http://www.tyan.com/datasheets/YR190-B8228_DataSheet.pdf or
http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=678&SKU=600000191),
and I hope that this list can help me determine the root of the problem,
if not work around or even fix it...
We have two Opteron 4272 with 32GB DDR3 ECC UDIMM in each of those
boards, and their configs are 100% identical - same BIOS release, same
CMOS sletettings, same BMC firmware release - the only thing that
differs from one machine to the other is the IP address of the IPMI mgmt
interface. Now, when we first got the system, I put it through our usual
testing routine, which includes a load test over the weekend, running
prime95 (http://mersenne.org/freesoft/) in torture test mode. That's
usually enough to rule out any kind of CPU, memory or thermal problems.
When torture-testing the new setup unter Debian Squeeze (with the 3.2.0
kernel from bpo, backported from Wheezy) we quickly found out that one
node kept failing (that is, became unresponsive to any kind of user
input) after a few minutes into the test, while the other node was happy
to run flawlessly for several days, with prime95 not reporting any
errors.
We gradually swapped all components from one node to the other to
identify which part was at fault - in the end, when we had swapped all
components except for the motherboard and chassis (so even PSUs and the
harddisks with their OS installations on them, etc.), the initially
faulty node still kept failing, which told us that, probably, the board
had some kind of defect. No problem, we got a replacent from the
manufacturer three days ago, and went on to put that new machine through
the same testing routine. Things is - it keeps failing in the exact
same way as before, despite both chassis and mainboard having been
replaced by new hardware. The problem is 100% reproducible with
Squeeze's default kernel (2.6.32) and a recent-ish test-build of 3.8.0
that I did. I also tried if selecting a different clocksource for the
system would make any kind of difference, but no matter if I use tsc,
hpet or acpi_pm (there's no more alternatives for the kernel image I
tested that with), the failure does happen after at most 20 minutes into
the test. Setting the "performance" instead of "ondemand" cpufreq
governor doesn't help either. System temperatures are fine.
When failing, the machine doesn't outright die, but reports CPU
soft-lockups on the console - networking goes down, the USB ports die,
and there's no way to interact with the machine any more apart from a
hard reboot. I've logged the console output using the serial console
support in the kernel (that does work even after we trigger the error,
though a shell on the serial console won't accept any more input once
that happens), and attached it for your viewing pleasure. Maybe you
have an idea what we can try to determine what's the all-important
difference between the silbling systems, or what we could do to get the
other, second node to work?
If you need me to provide any other info than what's already in this
mail, please let me know. Please keep me CC'd, as I'm a subscriber of
this list at present.
Thanks a bunch for reading, have a nice day :)
--
Mit freundlichen Grüßen
Johannes Truschnigg
Senior System Administrator
--
mailto:johannes.truschnigg@...zhals.at (in dringenden Fällen bitte an
info@...zhals.at)
Geizhals(R) - Preisvergleich Internet Services AG
Obere Donaustrasse 63/2
A-1020 Wien
Tel: +43 1 5811609/87
Fax: +43 1 5811609/55
http://geizhals.at => Preisvergleich für Österreich
http://geizhals.de => Preisvergleich für Deutschland
http://geizhals.eu => Preisvergleich EU-weit
Handelsgericht Wien | FN 197241K | Firmensitz Wien
View attachment "ser.out.3.8.0" of type "text/plain" (97404 bytes)
View attachment "ser.out.3.2.0" of type "text/plain" (57881 bytes)
Powered by blists - more mailing lists