linux-kernel - Re: System crash after "No irq handler for vector" linux 2.6.19

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0701231132090.22957@Phoenix.oltrelinux.com>
Date:	Tue, 23 Jan 2007 11:35:57 +0100 (CET)
From:	l.genoni@...relinux.com
To:	linux-kernel@...r.kernel.org
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19

On Mon, 22 Jan 2007, Eric W. Biederman wrote:

> "Luigi Genoni" <luigi.genoni@...elli.com> writes:
>
>> Hi,
>> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just 
after
>> giving this message
>>
>> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector
>
> Ok.  This indicates that the hardware is doing something we didn't 
expect.
> We don't know which irq the hardware was trying to deliver when it
> sent vector 0x98 to cpu 1.
>
>> I have no other logs, and I eventually lost the OOPS since I have no 
net
>> console setled up.
>
> If you had an oops it may have meant the above message was a secondary
> symptom.  Groan.  If it stayed up long enough to give an OOPS then
> there is a chance the above message appearing only once had nothing
> to do with the actual crash.
>
>
> How long had the system been up?

Sorry, my english is so bad, so I could not espress really what I wanted 
to
say.

I didn't get an OOPS. I could not see it on the console (nor logs). I do 
not
have netconsole

The system was up and running since 52 days.

>
>> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for 
AMD
>> Opteron (attached see .config), no kernel preemption excepted the BKL
>> preemption. glibc 2.4.
>>
>> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>>
>> I am running irqbalance 0.55.
>>
>> any hints on what has happened?
>
> Three guesses.
>
> - A race triggered by irq migration (but I would expect more people to 
be
> yelling).
>  The code path where that message comes from is new in 2.6.19 so it may 
not
> have
>  had all of the bugs found yet :(
> - A weird hardware or BIOS setup.
> - A secondary symptom triggered by some other bug.
>
> If this winds up being reproducible we should be able to track it down.
> If not this may end up in the files of crap something bad happened that
> we don't understand.
>
> The one condition I know how to test for (if you are willing) is an
> irq migration race.  Simply by triggering irq migration much more often,
> and thus increasing our chances of hitting a problem.

OK, will try tomorrow morning (I am at home right now, in Italy is night).

>
> Stopping irqbalance and running something like:
> for irq in 0 24 28 29 44 45 60 68 ; do
>       while :; do
>               for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 
4000
> 8000 ; do
>                       echo mask > /proc/irq/$irq/smp_affinity
>                       sleep 1
>               done
>       done &
> done
>
> Should force every irq to migrate once a second, and removing the sleep 
1
> is even harsher, although we max at one irq migration by irq received.
>
> If some variation of the above loop does not trigger the do_IRQ ??? No 
irq
> handler
> for vector message chances are it isn't a race in irq migration.
>
> If we can rule out the race scenario it will at least put us in the 
right
> direction
> for guessing what went wrong with your box.

btw, the server is running tibco (bwengine, imse and adr3, all tibco 
software
is multithread) which use multicast and stress the net cards continuosly
sending out and receiving multicast and UDP packets talking with a remote
oracle DB on eth2 (gigabit ethernet). tibco continuously creates and 
deletes
small files (small often less than 1 KB, thousands per minute when it has 
a lot
of work to do) on three SAN luns (IBM DS8000)  of 33 GB each one, accessed
trought 2 path fibre channels 2Gbit (linux dm multipath) with LVM2 volumes
(reiserFS, since gives us the best performance with all those small files, 
and
never gave us trobles).

In reality the system sees 21 luns on two paths, and those luns are also 
seen
by other 4 linux servers identical to this one clustered togheter with 
this
system using service guard (which is only userspace cluster software, no 
kernel
modules, eth0 and eth4 are HB and eth6 is the card on the public LAN ).
The Volume Group accessed by this server is reserved using LVM2 tags.

Tibco also does a normal I/O (about 2 MB/s) on a TCP NFS3 mounted volume 
using
eth2. Never had troubles with this NFS mount

Client nfs v3 (after 10 hour of uptime):
null       getattr    setattr    lookup     access     readlink
0       0% 2770320 78% 122     0% 546401 15% 19672   0% 133     0%
read       write      create     mkdir      symlink    mknod
40673   1% 3338    0% 58      0% 0       0% 0       0% 0       0%
remove     rmdir      rename     link       readdir    readdirplus
32      0% 0       0% 0       0% 0       0% 144213  4% 552     0%
fsstat     fsinfo     pathconf   commit
316     0% 872     0% 0       0% 16      0%


so network cards and SAN cards do work a lot. I include the boot messages,
maybe could help you to figure out how this server is configured.

what more... ah, usually the system during the sunday night is not working 
a
lot (not working at all), and it crashed a sunday night. Loocking the sar 
I see
strange statistics...

sar -I SUM        INTR    intr/s
04:40:01          sum    839.40
05:00:01          sum     62.10 <---
05:10:01          sum    245.63


sar -a
                    CPU     %user     %nice   %system   %iowait     %idle
04:40:01          all      1.14      0.00      0.44      0.06     98.36
05:00:01          all    287.15    288.31    287.73    288.17      0.00 <---
05:10:01          all      0.00      0.00      0.00      0.01     99.98

but normally during the day:
10:30:01          all     88.41      0.00      0.81      0.05     10.73)

sar -W        cswch/s
04:40:01      7738.69
05:00:01    478986467937.98 <---
05:10:01       194.36

(but normally during the day:
10:20:01     61264.85)

sar -b            tps      rtps      wtps   bread/s   bwrtn/s
04:40:01       187.05      0.02    187.03      0.27   1777.81
05:00:01       106.24    111.47    106.29    108.40     59.48 <---
05:10:01         2.48      0.02      2.47      0.27     28.40

sar -c           proc/s
04:40:01         0.34
05:00:01    478986468210.56 <--- this is absolutelly abnormal
05:10:01         0.03

That's all I can say about the race scenario. As you need more tests to be 
done, please tell me.

>
> Eric
>

Thanx

Luigi Genoni
View attachment "boot.msg" of type "TEXT/PLAIN" (46905 bytes)