linux-kernel - BUG: "do_IRQ: 0.39 No irq handler for vector" from a 16550 port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <87k1lvg2z4.fsf@gmail.com>
Date:   Fri, 02 Nov 2018 11:58:55 +0100
From:   Holger Schurig <holgerschurig@...il.com>
To:     linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, x86@...nel.org
Subject: BUG: "do_IRQ: 0.39 No irq handler for vector" from a 16550 port

Hi all,

I have a weird bug on systems that uses Haswell Architecture and "real"
serial ports /dev/ttyS*.

Hardware: some embedded device with "Intel(R) Celeron(R) 2980U @
1.60GHz", I tried with microcode 0x23 and 0x24. Also on a HP Elite 840
G1". Both have Haswell architecture.

I can plug a different CPU module into the embedded device, then I have
an "Intel(R) Atom(TM) CPU N455 @ 1.66GHz", obviously no Haswell. With
identical kernel, I don't get the same error.

Kernel: happens with distro kernels (Debian, Ubuntu, Fedora). Common
factor seems that the kernels are >= 4.9.x. But also with upstream
stable kernels, I used 4.13.x, 4.14.x, 4.18.x, even with 4.18.16.

The embedded device also behaves strange (e.g. I had once MCEs with a
32bit kernel, which went away when using a 64bit kernel). We also
sometimes get an error in AUFS with the same timestamp as the
do_IRQ-message. I don't understand what AUFS has to do with hardware
interrupts. However, I don't want to concentrate on this yet, I think
that strange message in a mainland kernel in itself is worthwhile to be
tracked. If some interrupt get's haywire, there is certainly the chance
that some memory get's corrupted. Also, this might be something totally
different, because the HP Elite doesn't show this. Also, the MCE went
away after switching from 32bit kernel to 64bit kernel.

So, let's return to the better reproducible "do_IRQ: 0.39 No irq handler
for vector".

I'm happy that I found a way to reproduce it: the message triggers when
I close the serial port. printk's indicate that after the IER is
cleared, and even after synchronize_irq() in serial8250_do_shutdown()
the error happens.

Sometimes even a "stty </dev/ttyS1" is enough, because it already
opens/closes the port. But it happens only sometimes.

A better way is to use a tool called "stress-ng" in version with various
stressors. Some newer version (e.g. the one in Debian, 0.07.16-1) just
open all files in /dev, run an fstat() on them, and close them again.
All of this in a loop and very fast. This has the side-effect that
/ttyS* are opened/closed very fast. And that shows the error message
easily:

[    6.558244] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   17.048154] fuse init (API version 7.27)
[   17.248215] do_IRQ: 0.39 No irq handler for vector
[   17.249622] do_IRQ: 0.39 No irq handler for vector
[   17.252415] do_IRQ: 0.39 No irq handler for vector
[   17.253698] do_IRQ: 0.39 No irq handler for vector
[   18.528774] do_IRQ: 0.39 No irq handler for vector
[   18.532305] do_IRQ: 0.39 No irq handler for vector
[   18.532540] do_IRQ: 0.39 No irq handler for vector
[   18.606916] do_IRQ: 0.39 No irq handler for vector
[   20.227241] random: crng init done

Here I did run stress-ng just for some seconds. Unfortunately, from
time to time the exact same setup makes the error scarce, e.g. it can
happen that we don't see the error for 15 minutes.

So when running this for a night I had between 1500 and 30000 of this
messages in my dmesg/journal.

One thing that I noticed is that "noapic=1" makes the error go away.

Also using the Atom cpu with the older architecture makes the error go
away, but that one is no EOL. :-(

Any advice on how to proceed further?

Greetings,
Holger