linux-kernel - Re: How to debug complete kernel lock-ups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <2c0942db0710311428i7675a4b6saf3f79dc60a4f0be@mail.gmail.com>
Date:	Wed, 31 Oct 2007 14:28:11 -0700
From:	"Ray Lee" <ray-lk@...rabbit.org>
To:	"John Sigler" <linux.kernel@...e.fr>
Cc:	linux-kernel@...r.kernel.org, linux-pci@...ey.karlin.mff.cuni.cz,
	greg@...ah.com, grundler@...isc-linux.org
Subject: Re: How to debug complete kernel lock-ups

On 10/31/07, John Sigler <linux.kernel@...e.fr> wrote:
> "It seems that the PCI clock on this system has a rather large over- and
> undershoot and we suspect that the undershoot (of ~1V) is causing a drop
> in the core voltage of the on-board FPGA which results in lockup of the
> firmware. Both the under- and overshoot are well outside the allowed
> ranges (high=VCC+0.5V and low=-0.5V) of the PCI specification and a
> premature conclusion might be that the system does not comply to the PCI
> spec and that this is the cause of the lockup on this PC."
>
> This is waaay out of my league, as my area is software.
>
> Is it typical for voltage issues to hang hardware?

Yes, if the voltage is applied (or lacking) at the right place.

> Is it typical for one PCI board locking up to nail the entire system?

This doesn't appear to be a case of the *board* crashing, but rather
the board taking the pci bus and related hardware on-motherboard down
with it. Once that's down, anything that you need that goes through
the bus (on a PC, that's pretty much everything), is inaccessible.

> I don't understand why the lockup would only happen when I write to the
> 4 ports within a small time frame, and not when I only write to 2 ports
> (either one port on each card, or 2 ports on the same card). I suspected
> some kind of concurrency issue...

No, given the hardware guy's description, it's a power issue. Perhaps
when you're writing to a port, you're using more power on the card?
Four ports = 4 * the power draw. When the current load increases,
voltage drops, and if you underpower a chip, it's going to lose its
little head.

> I suppose the next logical step is to get the board's engineers
> and the system's engineers duke it out? :-)

Yes, all signs point to it being a pure hardware issue. You may be
able to work around it in software by initializing a 'counting
semaphore' to 2 to manage the maximum concurrency, so that you'll
never write more than 2 ports at a time until the hardware guys figure
it out.

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/