lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 14 Nov 2008 22:01:46 -0600
From:	Roger Heflin <rogerheflin@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	LKML <linux-kernel@...r.kernel.org>,
	netdev <netdev@...r.kernel.org>
Subject: Re: WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0xfe/0x17e()
 with tg3 network

Peter Zijlstra wrote:
> (netdev CC'ed)
> 
> On Tue, 2008-11-11 at 03:48 -0600, Roger Heflin wrote:
>> I have duplicate this with kernel 2.6.27.2 and 2.6.27.5, no
>> extra modules, tg3 Gbit networking.   I have not yet tested
>> earlier kernels to see if this has been around for a while.
> 
> How do more recent kernels do?

I did not try more recent kernels, more testing seems to indicate
that at the very least it the bug depends on a certain version of
either tg3 and/or firmware to happen, as my second tg3 port does not
have it happen.    More about this below.

> 
>> So far I have had this error happen 5 times (MTBF is maybe
>> 12 hours), 4 of the 5 times resulted in the networking being
>> broken, one time things came back by itself without a reboot,
>> I believe in this case the hang was traffic coming into the
>> machine vs the other times going out of the machine.
>>
>> Unloading all of the network modules and reloading them did
>> not correct the problem.
>>
>> Searching google finds a couple of other people getting the
>> same error but they have a different network chipset (e1000
>> and a rt811C chipset), which makes me thing that there is
>> something interacting bad with the network.    Or does this
>> error truly mean that the network chipset for some unknown reason
>> locked itself up?
>>
>> http://www.google.com/url?sa=U&start=4&q=http://kerneltrap.org/mailarchive/linux-netdev/2008/8/6/2838184&ei=rU8ZScysAon8edz5xKgO&sig2=Wxp7IkUtdgORGZiflxvppg&usg=AFQjCNHzPwsCOmLGKmtX4q_FEpk6oubxxg
>> http://article.gmane.org/gmane.linux.network/110238
>>
>> The changes I made recently were to upgrade my MB (old
>> was E100 on a 100Mbit network,new is tg3 on a Gbit network,
>> cpu and memory are the same, MB chipset is a intel 955
>> chipset vs the old being a intel 915 chipset).
>>
>> Autoneg is turned on all around, the GBit switch is a
>> 8-port Dlink switch.   The network seems to otherwise be working
>> correctly.
>>
>> I did test the network under decent load and the error did not
>> appear to be any more likely under load, and typically the network
>> is under very light load 2-3MB/second.
>>
>> The machine originally had 2 HT CPU's showing up, I turned off HT
>> so that only one cpu was showing, but this did not change the error.
>>
>> I am first turning off all offload capabilities on tg3 and going
>> to see if that changes anything.

This made no difference in the error.

>>
>> The next thing I am going to be doing is to turn of GB capability
>> on the networking and see if that does anything.

Did not try.

>>
>> I also have a second tg3 port that is slightly different, so I may
>> try that eventually.

I tried this, and with the second port I don't appear to be getting
the error.   The first port is a 5789-v3.29a and the second port is a
5788-v3.04, I know the first port is faster (pcie-x1) than the second
port (pci bus-built-in, unknown exact connection).   The second port
will sustain about 50MB/second, were as the first port will get
 >90MB/second.

It seems to me to likely be the firmware on the tg3, and it would seem
unlikely that the driver could do anything more than work around the
issue that is in the firmware, and currently my system works on the
second port, and the second port is fast enough for my needs.

If someone else runs into this issue, since I have 2 ports I would be
able to do some testing on it, right now my first port is locked up, and
the machine is running fine on the second port.

lspci -vvv for the first (bad) port:

02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5789 Gigabit 
Ethernet PCI Express (rev 11)
         Subsystem: Foxconn International, Inc. Unknown device 0cc1
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR-
         Latency: 0, Cache Line Size: 32 bytes
         Interrupt: pin A routed to IRQ 19
         Region 0: Memory at fd8f0000 (64-bit, non-prefetchable) [size=64K]
         Expansion ROM at <ignored> [disabled]
         Capabilities: [48] Power Management version 2
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot+,D3cold+)
                 Status: D3 PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [50] Vital Product Data
         Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable-
                 Address: 0101b8102a0f7b0c  Data: f21e
         Capabilities: [d0] Express Endpoint IRQ 0
                 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
                 Device: Latency L0s <4us, L1 unlimited
                 Device: AtnBtn- AtnInd- PwrInd-
                 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                 Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                 Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes
                 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
                 Link: Latency L0s <2us, L1 <64us
                 Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
                 Link: Speed 2.5Gb/s, Width x1
         Capabilities: [100] Advanced Error Reporting
         Capabilities: [13c] Virtual Channel


>>
>> Nov 11 00:44:39 computer kernel: ------------[ cut here ]------------
>> Nov 11 00:44:39 computer kernel: WARNING: at net/sched/sch_generic.c:219 
>> dev_watchdog+0xfe/0x17e()
>> Nov 11 00:44:39 computer kernel: NETDEV WATCHDOG: eth0 (tg3): transmit timed out
>> Nov 11 00:44:39 computer kernel: Modules linked in: nfsd auth_rpcgss exportfs 
>> w83627ehf hwmon_vid hwmon nfs lockd nfs_acl sunrpc ipv6 xfs raid456 async_xor 
>> async_memcpy async_tx xor video output sbs sbshc battery ac lgdt330x cx88_dvb 
>> wm8775 cx88_vp3054_i2c cx25840 tuner_simple tuner_types tda9887 tda8290 tuner 
>> mt2131 s5h1409 snd_hda_intel snd_seq_dummy ivtv cx8800 snd_seq_oss cx88_alsa 
>> cx8802 cx88xx cx23885 snd_seq_midi_event snd_seq ir_common videodev v4l1_compat 
>> i2c_algo_bit cx2341x firewire_ohci iTCO_wdt snd_seq_device compat_ioctl32 
>> videobuf_dvb i2c_i801 firewire_core tveeprom floppy iTCO_vendor_support 
>> v4l2_common snd_pcm_oss dvb_core pcspkr tg3 sata_sil i2c_core btcx_risc 
>> videobuf_dma_sg crc_itu_t snd_mixer_oss libphy videobuf_core snd_pcm parport_pc 
>> parport snd_timer snd soundcore button snd_page_alloc sg dm_snapshot dm_zero 
>> dm_mirror dm_log dm_mod ahci ata_piix ata_generic libata sd_mod scsi_mod ext3 
>> jbd mbcache ehci_hcd ohci_hcd uhci_hcd [last unloaded: eeprom]
>> Nov 11 00:44:39 computer kernel: Pid: 0, comm: swapper Not tainted 2.6.27.5 #2
>> Nov 11 00:44:39 computer kernel:  [<c042524f>] warn_slowpath+0x61/0x83
>> Nov 11 00:44:39 computer kernel:  [<c05663a4>] usb_hcd_submit_urb+0x75c/0x811
>> Nov 11 00:44:39 computer kernel:  [<c0594972>] hiddev_hid_event+0x0/0x64
>> Nov 11 00:44:39 computer kernel:  [<c058ce80>] hid_process_event+0x58/0x5f
>> Nov 11 00:44:39 computer kernel:  [<c04e13d6>] __next_cpu+0x12/0x21
>> Nov 11 00:44:39 computer kernel:  [<c041cbe3>] find_busiest_group+0x23e/0x672
>> Nov 11 00:44:39 computer kernel:  [<c0439d1e>] clocksource_get_next+0x39/0x3f
>> Nov 11 00:44:39 computer kernel:  [<c0438e51>] update_wall_time+0x567/0x70c
>> Nov 11 00:44:39 computer kernel:  [<c040783e>] read_tsc+0x6/0x22
>> Nov 11 00:44:39 computer kernel:  [<c04387e8>] getnstimeofday+0x37/0xc1
>> Nov 11 00:44:39 computer kernel:  [<f8829a83>] uhci_scan_schedule+0x11b/0x6b0 
>> [uhci_hcd]
>> Nov 11 00:44:39 computer kernel:  [<c05b16ba>] dev_watchdog+0xfe/0x17e
>> Nov 11 00:44:39 computer kernel:  [<c042c66f>] __mod_timer+0x99/0xa3
>> Nov 11 00:44:39 computer kernel:  [<c05654b6>] rh_timer_func+0x0/0x5
>> Nov 11 00:44:39 computer kernel:  [<c05654ae>] usb_hcd_poll_rh_status+0x12b/0x133
>> Nov 11 00:44:39 computer kernel:  [<c043bca8>] tick_dev_program_event+0x1e/0x81
>> Nov 11 00:44:39 computer kernel:  [<c05b15bc>] dev_watchdog+0x0/0x17e
>> Nov 11 00:44:39 computer kernel:  [<c042c2b4>] run_timer_softirq+0x10e/0x167
>> Nov 11 00:44:39 computer kernel:  [<c05b15bc>] dev_watchdog+0x0/0x17e
>> Nov 11 00:44:39 computer kernel:  [<c0428d3e>] __do_softirq+0x5d/0xc1
>> Nov 11 00:44:39 computer kernel:  [<c0428dd4>] do_softirq+0x32/0x36
>> Nov 11 00:44:39 computer kernel:  [<c0412939>] smp_apic_timer_interrupt+0x6e/0x79
>> Nov 11 00:44:39 computer kernel:  [<c040431c>] apic_timer_interrupt+0x28/0x30
>> Nov 11 00:44:39 computer kernel:  [<c0408582>] mwait_idle+0x32/0x38
>> Nov 11 00:44:39 computer kernel:  [<c040255d>] cpu_idle+0xbd/0xd5
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists