lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87lklam937.fsf@sycorax.lbl.gov>
Date:	Thu, 14 Dec 2006 14:00:28 -0800
From:	Alex Romosan <romosan@...orax.lbl.gov>
To:	Stephen Hemminger <shemminger@...l.org>
Cc:	netdev@...r.kernel.org
Subject: Re: 2.6.20-rc1 sky2 problems (regression?)

Stephen Hemminger <shemminger@...l.org> writes:

> On Thu, 14 Dec 2006 12:47:05 -0800
> Alex Romosan <romosan@...orax.lbl.gov> wrote:
>
>> under heavy network load the sky2 driver (compiled in the kernel)
>> locks up and the only way i can get the network back is to reboot the
>> machine (bringing the network down and back up again doesn't help).
>> this happens on an amd64 machine (athlon 3500+ processor) and the card
>> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
>> Ethernet Controller (rev 15) (from lspci). this is what i see in the
>> syslog:
>> 
>> kernel: sky2 eth0: rx error, status 0x414a414a length 0
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
>> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
>> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
>> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
>> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
>> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
>> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
>> kernel:  [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9
>> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
>> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
>> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
>> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
>> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
>> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
>> kernel:  [<ffffffff80208710>] default_idle+0x0/0x3a
>> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
>> kernel:  <EOI>  [<ffffffff80208736>] default_idle+0x26/0x3a
>> kernel:  [<ffffffff8020878c>] cpu_idle+0x42/0x75
>> kernel:  [<ffffffff805df675>] start_kernel+0x1ce/0x1d3
>> kernel:  [<ffffffff805df140>] _sinittext+0x140/0x144
>> kernel: 
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
>> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
>> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
>> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
>> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
>> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
>> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
>> kernel:  [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5
>> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
>> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
>> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
>> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
>> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
>> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
>> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
>> kernel:  <EOI>  [<ffffffff802a8402>] inode2sd+0x104/0x117
>> kernel:  [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe
>> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
>> kernel:  [<ffffffff80284778>] ll_rw_block+0x89/0x9e
>> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
>> kernel:  [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d
>> kernel:  [<ffffffff80284053>] __find_get_block+0x197/0x1a5
>> kernel:  [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52
>> kernel:  [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284
>> kernel:  [<ffffffff80237700>] kthread+0xed/0xfd
>> kernel:  [<ffffffff802be990>] do_journal_end+0x34b/0xbdd
>> kernel:  [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76
>> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
>> kernel:  [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197
>> kernel:  [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f
>> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
>> kernel:  [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4
>> kernel:  [<ffffffff80271b3c>] __pollwait+0x0/0xe0
>> kernel:  [<ffffffff8022a006>] current_fs_time+0x35/0x3b
>> kernel:  [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec
>> kernel:  [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9
>> kernel:  [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210
>> kernel:  [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1
>> kernel:  [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1
>> kernel:  [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107
>> kernel:  [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e
>> kernel:  [<ffffffff80229d16>] getnstimeofday+0x10/0x28
>> kernel:  [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc
>> kernel:  [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b
>> kernel:  [<ffffffff80265a2c>] sys_writev+0x45/0x93
>> kernel:  [<ffffffff802096de>] system_call+0x7e/0x83
>> 
>> and so on. some times i don't get this trace but instead i get:
>> 
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
>> kernel: sky2 status report lost?
>> kernel: NETDEV WATCHDOG: eth0: transmit timed out
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
>> kernel: sky2 hardware hung? flushing
>> 
> Pleas report these problems to netdev@...r.kernel.org, I rarely go
> looking in LKML.
>
> These are the things you need to debug a sky2 related problem.
>
> 1) What is exact kernel version in use?  This is important because
>    problems get fixed but it can be a long while until the fix bubbles down
>    to the vendor kernels.

this is stock kernel.org kernel version 2.6.20-rc1 i downloaded this
morning. 2.6.19 and 2.6.19-rc6 i referred to in my original message
were also donloaded from kernel.org.

> 2) What is the chip version?  The driver prints this out on boot up in
>    the console log.   (dmesg | grep sky2)
>    This matters because each chip version has different
>    bugs to deal with.

sky2 v1.10 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1
sky2 eth0: addr 00:11:09:da:39:a3
sky2 eth0: enabling interface
sky2 eth0: ram buffer 48K
sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both


> 3) Does it work with the vendor driver?
>    The vendor driver does a number of things differently than the sky2 driver
>    and can mask problems, but if it doesn't work as well that is a useful
>    data point.  If you want to know why the sky2 driver was written instead
>    of just using the vendor driver, look at the code. The sk98lin driver
>    is huge, includes features that are unsupportable and broken, and locking
>    mistakes.  But the sk98lin also has a watchdog that masks off bugs and
>    may provide useful insight.

i haven't tried the vendor driver yet, but i guess i will, and let you
know what happens.

> 4) What is the IRQ routing?
>    There are two issues here, first the driver will never work with edge
>    trigger IRQ's, some motherboards also have busted BIOS and chipsets
>    that don't do MSI properly. A couple of module parameters are available
>    to help:
>       disable_msi=1   		avoids using MSI
>       idle_timeout=10		polls for lost IRQ's every N ms (10)

hmm, i have MSI interrupts enabled in the config and cat
/proc/interrups gives me:

283:    1474208   PCI-MSI-edge      eth0

so you say i should dissable msi?

> 5) What are the messages in the console log when problem happens?

see my original message i kept above.

> 6) Are you running any of the following: bonding, vlans, bridging,
>    netfilter, traffic control?

no.

> 7) Please get a current version of ethtool from:
>    git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
>    and run ethtool register dump after a problem occurs:
>       ethtool -d eth0

i've downloaded it and i'll run it next time the machine locks up.

> 8) Are you using a dual port board.  There were issues on the PCI-X
>    version that required hacks, the PCI-express version may have the
>    same problem.  Basically, checksum offload wouldn't work and receive
>    DMA's would arrive out of order.

it is a dual port board but i am using only one port.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ