[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2c0942db0710311428i7675a4b6saf3f79dc60a4f0be@mail.gmail.com>
Date: Wed, 31 Oct 2007 14:28:11 -0700
From: "Ray Lee" <ray-lk@...rabbit.org>
To: "John Sigler" <linux.kernel@...e.fr>
Cc: linux-kernel@...r.kernel.org, linux-pci@...ey.karlin.mff.cuni.cz,
greg@...ah.com, grundler@...isc-linux.org
Subject: Re: How to debug complete kernel lock-ups
On 10/31/07, John Sigler <linux.kernel@...e.fr> wrote:
> "It seems that the PCI clock on this system has a rather large over- and
> undershoot and we suspect that the undershoot (of ~1V) is causing a drop
> in the core voltage of the on-board FPGA which results in lockup of the
> firmware. Both the under- and overshoot are well outside the allowed
> ranges (high=VCC+0.5V and low=-0.5V) of the PCI specification and a
> premature conclusion might be that the system does not comply to the PCI
> spec and that this is the cause of the lockup on this PC."
>
> This is waaay out of my league, as my area is software.
>
> Is it typical for voltage issues to hang hardware?
Yes, if the voltage is applied (or lacking) at the right place.
> Is it typical for one PCI board locking up to nail the entire system?
This doesn't appear to be a case of the *board* crashing, but rather
the board taking the pci bus and related hardware on-motherboard down
with it. Once that's down, anything that you need that goes through
the bus (on a PC, that's pretty much everything), is inaccessible.
> I don't understand why the lockup would only happen when I write to the
> 4 ports within a small time frame, and not when I only write to 2 ports
> (either one port on each card, or 2 ports on the same card). I suspected
> some kind of concurrency issue...
No, given the hardware guy's description, it's a power issue. Perhaps
when you're writing to a port, you're using more power on the card?
Four ports = 4 * the power draw. When the current load increases,
voltage drops, and if you underpower a chip, it's going to lose its
little head.
> I suppose the next logical step is to get the board's engineers
> and the system's engineers duke it out? :-)
Yes, all signs point to it being a pure hardware issue. You may be
able to work around it in software by initializing a 'counting
semaphore' to 2 to manage the maximum concurrency, so that you'll
never write more than 2 ports at a time until the hardware guys figure
it out.
Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists