lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0704022012300.3963@sheep.housecafe.de>
Date:	Mon, 2 Apr 2007 20:41:00 +0100 (BST)
From:	Christian Kujau <evil@...ouse.de>
To:	linux-kernel@...r.kernel.org
cc:	netdev@...r.kernel.org, malte@...ouse.de
Subject: 2.6.20.4: NETDEV WATCHDOG and lockups


Hi there,

we have serious problems with 2 of our servers: both shiny new amd64 
dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
(eth1, irq11).

Both boxes are running fine but after "a while" they lock up and 
eventually restart all of a sudden. The last messages in the logfile 
are:

14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1

Then the box reboots, nothing else in the log.

As the servers have been set up recently, we only know that it happend 
with Debian's 2.6.17-? kernel. When we upgraded the installation, we 
went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla 
2.6.20.4 and while the problem persists, it takes longer to lockup (~20h 
as opposed to 4-5h). While this is a good thing for us, it's now harder
to reproduce (we have to wait longer).

Searching the archives turned up quite a few results but no real fix and 
lots of old postings too. We then disabled ACPI completely and booted 
with 'noapic'. Now both boxes are running for > 20h and we're curious 
how long they make it. However, booting with 'noapic' slowed down both 
servers *a lot*.

>From /proc/interrupts we can see that only CPU0 (core 0) is handling 
interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so 
that irqbalance(1) would work - but to no avail.

Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both 
hosts and feel free to ask for more details. Although both boxes are in 
production we'll be happy test more bootoptions/patches and the like.

TIA,
Christian.
-- 
BOFH excuse #266:

All of the packets are empty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ