lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 15 Nov 2006 11:01:00 -0800
From:	Ray Lee <ray-lk@...rabbit.org>
To:	Larry Finger <Larry.Finger@...inger.net>,
	Bcm43xx-dev@...ts.berlios.de, LKML <linux-kernel@...r.kernel.org>,
	netdev@...r.kernel.org, John Linville <linville@...driver.com>,
	Michael Buesch <mb@...sch.de>, Andrew Morton <akpm@...l.org>
Subject: bcm43xx regression 2.6.19rc3 -> rc5, rtnl_lock trouble?

Hey all,

I ran 2.6.19-rc3 for almost two weeks or so with no difficulties (none related
to the bcm43xx driver, at least). However, Andrew asked me to double check the
latest release to see if my problem report against 2.6.18 (hard locks) was
fixed. Good news is that it still is fixed. Bad news is that 2.6.19-rc5 is
worse than rc3 in other ways.

I've come back to my laptop being mostly dead after hours of it being off on
its own (twice now). Mostly dead meaning the keyboard is nearly
non-responsive, but the mouse works great (I'm in X, of course). I say 'nearly
dead' as sysrq-t,b works, so I'm sorta stumped there. (x-session seems to use
netlink, so perhaps that's the connection? ctrl-alt-f[1-7] don't do anything,
however.)

It seems to be a locking problem, though lockdep isn't catching it. I'll let
you guys decide though.

Regardless, here's what's I can see. My logs start filling with:

$ grep 'NETDEV WATCHDOG:' /var/log/messages |  cut -d '[' -f 2- | head
50025.388173] NETDEV WATCHDOG: eth1: transmit timed out
50029.019574] NETDEV WATCHDOG: eth1: transmit timed out
50030.835313] NETDEV WATCHDOG: eth1: transmit timed out
50032.651049] NETDEV WATCHDOG: eth1: transmit timed out
50034.466785] NETDEV WATCHDOG: eth1: transmit timed out
50036.282523] NETDEV WATCHDOG: eth1: transmit timed out
50038.098237] NETDEV WATCHDOG: eth1: transmit timed out
50039.913974] NETDEV WATCHDOG: eth1: transmit timed out
50041.729709] NETDEV WATCHDOG: eth1: transmit timed out
50043.545447] NETDEV WATCHDOG: eth1: transmit timed out
(...1249 of these, so it doesn't fix itself.)

and then the system becomes pretty worthless. (Full /var/log/messages with
sysrq-t at: http://madrabbit.org/~ray/messages.gz ).

Interesting bits of that:

$ grep -B5 -A10 'Nov 13 01:5.*mutex' /var/log/messages | cut -d ']' -f2-
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 khelper       S ffff810037fbe318     0     5      1             6     4 (L-TLB)
  ffff810037907e60 0000000000000046 ffff810037907e70 ffff810037fbe140
  ffff81001095f140 0000000000003b5d ffff810001e3e668 0000000000000286
  ffff810037907e40 ffffffff8026bbb2 ffff810037907e70 ffff810001e3e600
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 kthread       S ffff810037fad218     0     6      1    25    2129     5 (L-TLB)
  ffff810037f01e60 0000000000000046 ffff810037f01e70 ffff810037fad040
  ffff81002b3df140 000000000000062b ffff810001e3e468 0000000000000286
  ffff810037f01e40 ffffffff8026bbb2 ffff810037f01e70 ffff810001e3e400
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 kblockd/0     S ffff810037989318     0    25      6            26       (L-TLB)
  ffff81003798fe60 0000000000000046 ffff81003798fe70 ffff810037989140
  ffff8100379a5100 000000000000078b ffff810037fa2468 0000000000000286
  ffff81003798fe40 ffffffff8026bbb2 ffff81003798fe70 ffff810037fa2400
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 NetworkManage D ffff810037943258     0  4833      1          4853  4809 (NOTLB)
  ffff81002bfefbe8 0000000000000046 ffff81002bfefb98 ffff810037943080
  ffff81002e6d2100 00000000000122a6 ffffffff8062ce80 0000000000000046
  0000000000000246 ffff810037943080 ffff81002e47b3f0 ffff81002e47b3a0
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [_end+126343345/2126632680] :bcm43xx:bcm43xx_wx_get_mode+0x29/0x60
  [ioctl_standard_call+139/944] ioctl_standard_call+0x8b/0x3b0
  [wireless_process_ioctl+260/976] wireless_process_ioctl+0x104/0x3d0
  [dev_ioctl+854/944] dev_ioctl+0x356/0x3b0
  [sock_ioctl+576/624] sock_ioctl+0x240/0x270
  [do_ioctl+49/160] do_ioctl+0x31/0xa0
  [vfs_ioctl+683/720] vfs_ioctl+0x2ab/0x2d0
  [sys_ioctl+106/160] sys_ioctl+0x6a/0xa0
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83
--
 x-session-man D ffff81002ef02298     0  5625   4565  5672          4586 (NOTLB)
  ffff810028a1fad8 0000000000000046 ffffffff8062c500 ffff81002ef020c0
  ffff8100249a6040 0000000000008c5d 0000000000000000 0000000000000046
  0000000000000246 ffff81002ef020c0 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

--
 gweather-appl D ffff810023b43218     0  7198      1          7199  3540 (NOTLB)
  ffff810033315ad8 0000000000000046 0000000000000000 ffff810023b43040
  ffff81001f47a0c0 000000000001215f 0000000000000000 0000000000000046
  0000000000000246 ffff810023b43040 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

--
 gweather-appl D ffff81001f47a298     0  7199      1          7200  7198 (NOTLB)
  ffff8100338efad8 0000000000000046 0000000000000000 ffff81001f47a0c0
  ffff81000eee2040 0000000000009ce2 0000000000000000 0000000000000046
  0000000000000246 ffff81001f47a0c0 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

--
 gweather-appl D ffff81000eee2218     0  7200      1          9252  7199 (NOTLB)
  ffff810019d3dad8 0000000000000046 0000000000000000 ffff81000eee2040
  ffff81002178d0c0 000000000000cb7b ffffffff8062ce80 0000000000000046
  0000000000000246 ffff81000eee2040 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

--
 wpa_supplican D ffff810026e9a218     0  8058      1         19402  6666 (NOTLB)
  ffff81001bf81ad8 0000000000000046 0000000000000002 ffff810026e9a040
  ffff8100072e4140 000000000001b54f 0000000000000000 0000000000000046
  0000000000000246 ffff810026e9a040 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

--
 dhclient      D ffff8100353092d8     0  9218   4784                     (NOTLB)
  ffff810007811d28 0000000000000046 00000000000000e2 ffff810035309100
  ffffffff80518520 000000000000920c 0000000000000000 0000000000000046
  0000000000000246 ffff810035309100 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnl_lock+16/32] rtnl_lock+0x10/0x20
  [dev_ioctl+53/944] dev_ioctl+0x35/0x3b0
  [sock_ioctl+576/624] sock_ioctl+0x240/0x270
  [do_ioctl+49/160] do_ioctl+0x31/0xa0
  [vfs_ioctl+683/720] vfs_ioctl+0x2ab/0x2d0
  [sys_ioctl+106/160] sys_ioctl+0x6a/0xa0
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

 Leftover inexact backtrace:
--
 ip            D ffff8100072e4318     0  9221   4809                     (NOTLB)
  ffff8100073e5ad8 0000000000000046 ffff810037bcbbf8 ffff8100072e4140
  ffffffff80518520 000000000002322b 0000000000000000 0000000000000046
  0000000000000246 ffff8100072e4140 ffffffff805505b0 ffffffff80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [rtnetlink_rcv+44/96] rtnetlink_rcv+0x2c/0x60
  [netlink_data_ready+26/96] netlink_data_ready+0x1a/0x60
  [netlink_sendskb+51/96] netlink_sendskb+0x33/0x60
  [netlink_unicast+536/592] netlink_unicast+0x218/0x250
  [netlink_sendmsg+704/736] netlink_sendmsg+0x2c0/0x2e0
  [sock_sendmsg+271/320] sock_sendmsg+0x10f/0x140
  [sys_sendto+273/320] sys_sendto+0x111/0x140
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83

[...]
52234.341105] Showing all locks held in the system:
52234.341113]
52234.341114] =============================================
52234.341115]
52234.394782] NETDEV WATCHDOG: eth1: transmit timed out
52236.210364] NETDEV WATCHDOG: eth1: transmit timed out
52238.026024] NETDEV WATCHDOG: eth1: transmit timed out
[...more of these, 'til I reboot with sysrq-b]


Notice that 'all locks held in the system' is empty.


There were three changes between rc3 and rc5:

$ hg log -rv2.6.19-rc3:v2.6.19-rc5 -I drivers/net/wireless/bcm43xx/
changeset:   40500:4ef6746b2f06
user:        Al Viro <viro@....linux.org.uk>
date:        Wed Oct 25 12:01:11 2006 +0700
summary:     [PATCH] missing include of dma-mapping.h

changeset:   40964:ca97546422bd
user:        Michael Buesch <mb@...sch.de>
date:        Wed Nov 01 08:15:40 2006 +0500
summary:     [PATCH] bcm43xx: Fix low-traffic netdev watchdog TX timeouts

changeset:   40965:f5021f3521c2
user:        Larry Finger <Larry.Finger@...inger.net>
date:        Wed Nov 01 08:15:41 2006 +0500
summary:     [PATCH] bcm43xx: fix unexpected LED control values in BCM4303 sprom


Of those, the middle one is the most suspicious.

ray@...enix:~/work/kernel/linux-2.6$ hg log -v -p -r 40964
changeset:   40964:ca97546422bd9a52a7000607d657ca2915f31104
user:        Michael Buesch <mb@...sch.de>
date:        Wed Nov 01 08:15:40 2006 +0500
files:       drivers/net/wireless/bcm43xx/bcm43xx_main.c
description:
[PATCH] bcm43xx: Fix low-traffic netdev watchdog TX timeouts

This fixes a netdev watchdog timeout problem.
The software needs to call netif_tx_disable before running the
hardware calibration code. The problem condition can be shown by the
following timegraph.

|---5secs - ~10 jiffies time---|---|OOPS
^                              ^
last real TX                   periodic work stops netif

At OOPS, the following happens:
The watchdog timer triggers, because the timeout of 5secs
is over. The watchdog first checks for stopped TX.
_Usually_ TX is only stopped from the TX handler to indicate
a full TX queue. But this is different. We need to stop TX here,
regardless of the TX queue state. So the watchdog recognizes
the stopped device and assumes it is stopped due to full
TX queues (Which is a _wrong_ assumption in this case). It then
tests how far the last TX has been in the past. If it's more than
5secs (which is the case for low or no traffic), it will fire
a TX timeout.

Signed-off-by: Michael Buesch <mb@...sch.de>
Signed-off-by: Larry Finger <Larry.Finger@...inger.net>
Signed-off-by: John W. Linville <linville@...driver.com>

committer: John W. Linville <linville@...top.(none)> 1162350940 -0500


diff -r 41ff0150cbadd56e692f148adb1bfd4ca420e3e0 -r
ca97546422bd9a52a7000607d657ca2915f31104
drivers/net/wireless/bcm43xx/bcm43xx_main.c
--- a/drivers/net/wireless/bcm43xx/bcm43xx_main.c       Wed Nov 01 08:15:39
2006 +0500
+++ b/drivers/net/wireless/bcm43xx/bcm43xx_main.c       Wed Nov 01 08:15:40
2006 +0500
@@ -3163,9 +3163,11 @@ static void bcm43xx_periodic_work_handle
 static void bcm43xx_periodic_work_handler(void *d)
 {
        struct bcm43xx_private *bcm = d;
+       struct net_device *net_dev = bcm->net_dev;
        unsigned long flags;
        u32 savedirqs = 0;
        int badness;
+       unsigned long orig_trans_start = 0;

        mutex_lock(&bcm->mutex);
        badness = estimate_periodic_work_badness(bcm->periodic_state);
@@ -3173,7 +3175,18 @@ static void bcm43xx_periodic_work_handle
                /* Periodic work will take a long time, so we want it to
                 * be preemtible.
                 */
-               netif_tx_disable(bcm->net_dev);
+
+               netif_tx_lock_bh(net_dev);
+               /* We must fake a started transmission here, as we are going to
+                * disable TX. If we wouldn't fake a TX, it would be possible to
+                * trigger the netdev watchdog, if the last real TX is already
+                * some time on the past (slightly less than 5secs)
+                */
+               orig_trans_start = net_dev->trans_start;
+               net_dev->trans_start = jiffies;
+               netif_stop_queue(net_dev);
+               netif_tx_unlock_bh(net_dev);
+
                spin_lock_irqsave(&bcm->irq_lock, flags);
                bcm43xx_mac_suspend(bcm);
                if (bcm43xx_using_pio(bcm))
@@ -3198,6 +3211,7 @@ static void bcm43xx_periodic_work_handle
                        bcm43xx_pio_thaw_txqueues(bcm);
                bcm43xx_mac_enable(bcm);
                netif_wake_queue(bcm->net_dev);
+               net_dev->trans_start = orig_trans_start;
        }
        mmiowb();
        spin_unlock_irqrestore(&bcm->irq_lock, flags);

...so, could someone with more of a clue than I review the locking here?

In the meantime, I could try to revert this and run with the changed kernel,
but I've only had this happen twice in the past four days (dunno how to
trigger it other than walking away from the laptop). As rc3 didn't give me any
troubles for the 13 some-odd days I ran it, I think it's safe to say this is a
new behavior with rc5 vs rc3.

Suggestions? Requests for <shudder> even more info?

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists