linux-kernel - Re: Unhandled IRQs on AMD E-450

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4EE0B156.4080708@ladisch.de>
Date:	Thu, 08 Dec 2011 13:45:10 +0100
From:	Clemens Ladisch <clemens@...isch.de>
To:	Jeroen Van den Keybus <jeroen.vandenkeybus@...il.com>
CC:	"Huang, Shane" <Shane.Huang@....com>,
	Borislav Petkov <bp@...64.org>,
	"Nguyen, Dong" <Dong.Nguyen@....com>, linux-kernel@...r.kernel.org,
	linux1394-devel@...ts.sourceforge.net
Subject: Re: Unhandled IRQs on AMD E-450

Jeroen Van den Keybus wrote:
> I have the impression that I see the same failure mechanism for both
> IRQs. All goes well for a while, until an IRQ storm starts right
> (e1000: 19 us, firewire-ohci: 39 us) after a valid IRQ.
>
> Therefore there is a strong correlation between the arrival of the
> spurious interrupt, alledgedly caused by a mystery device, and the
> earlier arrival of a valid interrupt for a device. Combined with the
> fact that it happens on 2 different IRQs pretty much rules out the
> possibilty for me that there is either a mystery device at all, or
> that the existing devices would both be defective, does it not ?

There appears to be a problem with the interrupt handling.

In PCI, interrupts are level-triggered, which means that the interrupt
line (INTx) is active when it's at level 0 and inactive when it's at
level 1.  When a device wants to trigger an interrupt, it outputs zero
on its interrupt output.  The level doesn't get reset to 1 until the
driver acknowledges the interrupt (in e1000, read of the ICR; in
firewire-ohci, write of IntEventClear).  As long as the line stays at 0,
all interrupt handlers will continue being called.  This mechanism
allows multiple devices to share one interrupt line.

In PCI Express, there are only one-to-one connections, and there are no
separate interrupt lines.  A device raises an interrupt by sending
an interrupt message, which could be understood as a memory write to
a special address at the interrupt controller.  Nothing needs to be done
to deactive the interrupt; if the device has another reason for
an interrupt, it just sends another interrupt message.

When a PCI device is connected to a PCI Express system, the old INTx
interrupt line must be converted to PCI Express messages.  This is done
with _two_ special messages, Assert_INTx and Deassert_INTx.  The first
tells the interrupt controller that some INTx line went from 1 to 0, the
second tells it that it went from 0 back to 1; this allows the interrupt
controller to implement the level-triggered behaviour.

It appears that some Deassert_INTx messages get lost on your system.
There are no indications of any other missing PCIe packets, so this
looks like a problem with the interrupt handling in your PCI/PCIe
bridge, the ASM1083 chip.

> I also do not understand, if there would be a stuck IRQ line, why I
> can unload and reload e1000 and firewire-ohci without immediately
> getting the same IRQ storm.

Linux will reenable the interrupt line when a new driver attaches to it.
At this point, it's still stuck, but the device initialization will
trigger some actual interrupts, and after the first assert/deassert
pair, the line will be unstuck.

Regards,
Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/