[<prev] [next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0701231132090.22957@Phoenix.oltrelinux.com>
Date: Tue, 23 Jan 2007 11:35:57 +0100 (CET)
From: l.genoni@...relinux.com
To: linux-kernel@...r.kernel.org
Subject: Re: System crash after "No irq handler for vector" linux 2.6.19
On Mon, 22 Jan 2007, Eric W. Biederman wrote:
> "Luigi Genoni" <luigi.genoni@...elli.com> writes:
>
>> Hi,
>> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just
after
>> giving this message
>>
>> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector
>
> Ok. This indicates that the hardware is doing something we didn't
expect.
> We don't know which irq the hardware was trying to deliver when it
> sent vector 0x98 to cpu 1.
>
>> I have no other logs, and I eventually lost the OOPS since I have no
net
>> console setled up.
>
> If you had an oops it may have meant the above message was a secondary
> symptom. Groan. If it stayed up long enough to give an OOPS then
> there is a chance the above message appearing only once had nothing
> to do with the actual crash.
>
>
> How long had the system been up?
Sorry, my english is so bad, so I could not espress really what I wanted
to
say.
I didn't get an OOPS. I could not see it on the console (nor logs). I do
not
have netconsole
The system was up and running since 52 days.
>
>> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for
AMD
>> Opteron (attached see .config), no kernel preemption excepted the BKL
>> preemption. glibc 2.4.
>>
>> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>>
>> I am running irqbalance 0.55.
>>
>> any hints on what has happened?
>
> Three guesses.
>
> - A race triggered by irq migration (but I would expect more people to
be
> yelling).
> The code path where that message comes from is new in 2.6.19 so it may
not
> have
> had all of the bugs found yet :(
> - A weird hardware or BIOS setup.
> - A secondary symptom triggered by some other bug.
>
> If this winds up being reproducible we should be able to track it down.
> If not this may end up in the files of crap something bad happened that
> we don't understand.
>
> The one condition I know how to test for (if you are willing) is an
> irq migration race. Simply by triggering irq migration much more often,
> and thus increasing our chances of hitting a problem.
OK, will try tomorrow morning (I am at home right now, in Italy is night).
>
> Stopping irqbalance and running something like:
> for irq in 0 24 28 29 44 45 60 68 ; do
> while :; do
> for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000
4000
> 8000 ; do
> echo mask > /proc/irq/$irq/smp_affinity
> sleep 1
> done
> done &
> done
>
> Should force every irq to migrate once a second, and removing the sleep
1
> is even harsher, although we max at one irq migration by irq received.
>
> If some variation of the above loop does not trigger the do_IRQ ??? No
irq
> handler
> for vector message chances are it isn't a race in irq migration.
>
> If we can rule out the race scenario it will at least put us in the
right
> direction
> for guessing what went wrong with your box.
btw, the server is running tibco (bwengine, imse and adr3, all tibco
software
is multithread) which use multicast and stress the net cards continuosly
sending out and receiving multicast and UDP packets talking with a remote
oracle DB on eth2 (gigabit ethernet). tibco continuously creates and
deletes
small files (small often less than 1 KB, thousands per minute when it has
a lot
of work to do) on three SAN luns (IBM DS8000) of 33 GB each one, accessed
trought 2 path fibre channels 2Gbit (linux dm multipath) with LVM2 volumes
(reiserFS, since gives us the best performance with all those small files,
and
never gave us trobles).
In reality the system sees 21 luns on two paths, and those luns are also
seen
by other 4 linux servers identical to this one clustered togheter with
this
system using service guard (which is only userspace cluster software, no
kernel
modules, eth0 and eth4 are HB and eth6 is the card on the public LAN ).
The Volume Group accessed by this server is reserved using LVM2 tags.
Tibco also does a normal I/O (about 2 MB/s) on a TCP NFS3 mounted volume
using
eth2. Never had troubles with this NFS mount
Client nfs v3 (after 10 hour of uptime):
null getattr setattr lookup access readlink
0 0% 2770320 78% 122 0% 546401 15% 19672 0% 133 0%
read write create mkdir symlink mknod
40673 1% 3338 0% 58 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
32 0% 0 0% 0 0% 0 0% 144213 4% 552 0%
fsstat fsinfo pathconf commit
316 0% 872 0% 0 0% 16 0%
so network cards and SAN cards do work a lot. I include the boot messages,
maybe could help you to figure out how this server is configured.
what more... ah, usually the system during the sunday night is not working
a
lot (not working at all), and it crashed a sunday night. Loocking the sar
I see
strange statistics...
sar -I SUM INTR intr/s
04:40:01 sum 839.40
05:00:01 sum 62.10 <---
05:10:01 sum 245.63
sar -a
CPU %user %nice %system %iowait %idle
04:40:01 all 1.14 0.00 0.44 0.06 98.36
05:00:01 all 287.15 288.31 287.73 288.17 0.00 <---
05:10:01 all 0.00 0.00 0.00 0.01 99.98
but normally during the day:
10:30:01 all 88.41 0.00 0.81 0.05 10.73)
sar -W cswch/s
04:40:01 7738.69
05:00:01 478986467937.98 <---
05:10:01 194.36
(but normally during the day:
10:20:01 61264.85)
sar -b tps rtps wtps bread/s bwrtn/s
04:40:01 187.05 0.02 187.03 0.27 1777.81
05:00:01 106.24 111.47 106.29 108.40 59.48 <---
05:10:01 2.48 0.02 2.47 0.27 28.40
sar -c proc/s
04:40:01 0.34
05:00:01 478986468210.56 <--- this is absolutelly abnormal
05:10:01 0.03
That's all I can say about the race scenario. As you need more tests to be
done, please tell me.
>
> Eric
>
Thanx
Luigi Genoni
View attachment "boot.msg" of type "TEXT/PLAIN" (46905 bytes)
Powered by blists - more mailing lists