lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Sat, 17 Dec 2011 19:31:42 -0800
From:	Matt Ginzton <matt@...zton.net>
To:	nic_swsd@...ltek.com, romieu@...zoreil.com
Cc:	netdev@...r.kernel.org
Subject: reproducible kernel freeze using r8169 in Linux 3.0

Hi r8169 maintainers,

I have a completely reproducible kernel freeze apparently in the r8169 driver. All I have to do is generate a decent network load in both directions, and the kernel will freeze hard, not responding to network requests or input from a local console including magic sysrq, necessitating a hard reboot.

Summary: r8169 from Linux 3.0 freezes reproducibly on my RTL8111/R8168B hardware. r8169 from Linux 3.1 is better -- no freeze, but it still responds with a kernel warning to my repro scenario.

Details:

Software: Ubuntu 11.10 Server, x86, all updates as of today.
Hardware: Fit-pc2i, specs at http://www.fit-pc.com/web/fit-pc2/fit-pc2i-specifications/
uname -a: Linux flux 3.0.0-14-generic #23-Ubuntu SMP
lspci | grep Realtek: 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet Controller (rev 02)
ethtool -i eth0 says: driver: r8169, version: 2.3LK-NAPI, firmware-version: N/A

How I can reproduce the freeze:

- configure pktgen to generate a lot of TX traffic on eth0. I started with pktgen config in ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/pktgen.conf-1-1, and modified it to use 1500 byte packets. Leaving it at the default 10 million packet count, the test generates 15GB of outgoing packets and takes about 3 minutes to run.
- during that 3 minutes, generate some RX traffic too. On another box, I do "dd if=/dev/zero of=/tmp/foo bs=1048576 count=128". I then scp that file back to the target box. If this succeeds, I do it again.
- usually this will reproduce the freeze within a few seconds, and always within a minute or two.
- often but not always, before it freezes, I will see multiple "r8169 0000:02:00.0: eth0: link up" messages (with no intervening "link down"), if I'm watching the kernel log or appropriately configured console. This is kind of a canary for the impending freeze.

What happens when it freezes:

- apparently, nothing at all, though the other test box running the client side of the scp will notice the transfer stalling.
- at this point, the target box won't respond at all on the network.
- it also won't respond to keyboard input from a USB keyboard
- including the magic sysrq key, which is configured (i.e. it works before the freeze)
- even if I use sysrq+9 to set console loglevel to 9, nothing is printed to console when it freezes
- upon reboot, there's nothing interesting or recent in /var/log/kern.log
- the "eth0: link up" messages being the exception; they do show up on the console at log level 9, and do show up in kern.log if it was synced before the freeze.
- I tried configuring nmi watchdog but could not get it to work on this box. So, no info from where it's freezing.

So what then:

It was the multiple "eth0: link up" messages that drew my attention in the direction of the network. As soon as I started googling for "r8169 freeze", I found all sorts of reports of different problems with r8169, going back years, so it's hard to know what's still relevant now and on my hardware -- people complain about freezes, dropped packets, refusal to autonegotiate media type or to work at rates faster than 10mbit… I didn't have any of these problems except the freeze.

The fix usually suggested for problems with r8169 is to go to Realtek and get r8168 instead. I'm a little leery of this (why is there an apparently open-source driver but not in-tree, maintained only by Realtek but with modinfo reporting "author: Realtek and the Linux r8168 crew <netdev@...r.kernel.org>"?; also I noticed http://packages.debian.org/unstable/main/r8168-dkms which says "This driver should only be used for devices not yet supported by the in-kernel driver r8169") but hey, I gave it a try. I downloaded the 8.027 version of this, compiled it, installed, it, used it, and couldn't get it to crash, but also couldn't get it to work with vlan virtual interfaces, which I intend to use (and which work fine with r8169), so it's not much good to me. (I gather from, for example, http://patchwork.ozlabs.org/patch/28045/ that this list doesn't maintain r8168 and doesn't care about bugs in it and is annoyed by the author claim, so I'm not coming here to complain about r8168, but just to point out that I did try it.)

I was losing hope of getting reliable Linux networking on this fit-pc at all, but then I found https://bugs.launchpad.net/ubuntu/+source/linux-backports-modules-3.0.0/+bug/839393, wherein some enterprising souls have backported the r8169 driver from "Linux 3.1" (I don't know exactly what vintage 3.1) because it's more reliable. So I tried that backported one and it does indeed seem to work better. If I perform the same test as above (pktgen generating TX traffic and scp generating RX traffic), I still see the multiple "eth0: link up" messages, but now the kernel does not freeze, and I get a backtrace (once -- soon after the "link up" thing starts, probably corresponding to when the freeze would have happened).

A run of my test with this driver, including the kernel backtrace, looks something like:

Dec 17 19:20:22 flux kernel: [  200.958872] pktgen: Packet Generator for packet performance testing. Version: 2.74
Dec 17 19:20:34 flux kernel: [  213.029292] SysRq : Changing Loglevel
Dec 17 19:20:34 flux kernel: [  213.029437] Loglevel set to 9
Dec 17 19:20:53 flux kernel: [  232.052203] r8169 0000:02:00.0: eth0: link up
Dec 17 19:20:55 flux kernel: [  233.772238] r8169 0000:02:00.0: eth0: link up
Dec 17 19:20:55 flux kernel: [  233.788188] r8169 0000:02:00.0: eth0: link up
Dec 17 19:20:55 flux kernel: [  234.196614] ------------[ cut here ]------------
Dec 17 19:20:55 flux kernel: [  234.196802] WARNING: at /build/buildd/linux-3.0.0/net/core/dev.c:3809 net_rx_action+0x1f3/0x220()
Dec 17 19:20:55 flux kernel: [  234.197053] Hardware name: CM-iAM/SBC-FITPC2i
Dec 17 19:20:55 flux kernel: [  234.197181] Modules linked in: pktgen act_police cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_time xt_connlimit xt_realm xt_addrtype iptable_raw xt_comment xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt
Dec 17 19:20:55 flux kernel: _AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables 8021q garp stp vesafb i2c_isch sch_gpio psb_gfx(C) snd_hda_codec_realtek psmouse drm_kms_helper snd_hda_intel serio_raw snd_hda_codec drm snd_hwdep lpc_sch snd_pcm snd_timer i2c_algo_bit snd soundcore snd_page_alloc poulsbo video lp parport usbhid hid pata_sch sdhci_pci sdhci r8169
Dec 17 19:20:55 flux kernel: [  234.202347] Pid: 1550, comm: kpktgend_0 Tainted: G         C  3.0.0-14-generic #23-Ubuntu
Dec 17 19:20:55 flux kernel: [  234.202576] Call Trace:
Dec 17 19:20:55 flux kernel: [  234.202664]  [<c151a3f2>] ? printk+0x2d/0x2f
Dec 17 19:20:55 flux kernel: [  234.202804]  [<c1047a22>] warn_slowpath_common+0x72/0xa0
Dec 17 19:20:55 flux kernel: [  234.202962]  [<c143ba03>] ? net_rx_action+0x1f3/0x220
Dec 17 19:20:55 flux kernel: [  234.203112]  [<c143ba03>] ? net_rx_action+0x1f3/0x220
Dec 17 19:20:55 flux kernel: [  234.203263]  [<c1047a72>] warn_slowpath_null+0x22/0x30
Dec 17 19:20:55 flux kernel: [  234.203417]  [<c143ba03>] net_rx_action+0x1f3/0x220
Dec 17 19:20:55 flux kernel: [  234.203567]  [<c104e570>] ? local_bh_enable_ip+0x90/0x90
Dec 17 19:20:55 flux kernel: [  234.203725]  [<c104e5f1>] __do_softirq+0x81/0x1a0
Dec 17 19:20:55 flux kernel: [  234.203868]  [<c104e570>] ? local_bh_enable_ip+0x90/0x90
Dec 17 19:20:55 flux kernel: [  234.204041]  <IRQ>  [<c104e559>] ? local_bh_enable_ip+0x79/0x90
Dec 17 19:20:55 flux kernel: [  234.204244]  [<c152dbd6>] ? _raw_spin_unlock_bh+0x16/0x20
Dec 17 19:20:55 flux kernel: [  234.204411]  [<f98430a4>] ? pktgen_xmit+0x144/0x270 [pktgen]
Dec 17 19:20:55 flux kernel: [  234.204623]  [<f80479c0>] ? rtl8169_close+0x210/0x210 [r8169]
Dec 17 19:20:55 flux kernel: [  234.204795]  [<f984007b>] ? pktgen_change_name+0x7b/0xa0 [pktgen]
Dec 17 19:20:55 flux kernel: [  234.204974]  [<f98432c4>] ? pktgen_thread_worker+0xf4/0x370 [pktgen]
Dec 17 19:20:55 flux kernel: [  234.205167]  [<c1066380>] ? add_wait_queue+0x50/0x50
Dec 17 19:20:55 flux kernel: [  234.205316]  [<c1066380>] ? add_wait_queue+0x50/0x50
Dec 17 19:20:55 flux kernel: [  234.205467]  [<f98431d0>] ? pktgen_xmit+0x270/0x270 [pktgen]
Dec 17 19:20:55 flux kernel: [  234.205634]  [<c1065b7d>] ? kthread+0x6d/0x80
Dec 17 19:20:55 flux kernel: [  234.205767]  [<c1065b10>] ? flush_kthread_worker+0x80/0x80
Dec 17 19:20:55 flux kernel: [  234.205931]  [<c153517e>] ? kernel_thread_helper+0x6/0x10
Dec 17 19:20:55 flux kernel: [  234.217422] ---[ end trace 7229c628b96ddd29 ]---
Dec 17 19:20:55 flux kernel: [  234.229519] r8169 0000:02:00.0: eth0: link up
Dec 17 19:20:56 flux kernel: [  234.604219] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:00 flux kernel: [  238.592202] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:03 flux kernel: [  241.848236] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:05 flux kernel: [  244.144217] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:07 flux kernel: [  245.540217] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:14 flux kernel: [  253.228245] r8169 0000:02:00.0: eth0: link up
Dec 17 19:21:14 flux kernel: [  253.260220] r8169 0000:02:00.0: eth0: link up

Please let me know if I'm reporting this to the right place, or if there's any other helpful info I can provide.

thanks,

Matt

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ