[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <4B379378.4000102@majjas.com>
Date: Sun, 27 Dec 2009 12:03:52 -0500
From: Michael Breuer <mbreuer@...jas.com>
To: Stephen Hemminger <shemminger@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Berck E. Nash" <flyboy@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
netdev@...r.kernel.org
Subject: Re: sky2 panic in 2.6.32.1 under load (new oops)
On 12/26/2009 3:37 PM, Michael Breuer wrote:
> On 12/26/2009 12:57 PM, Stephen Hemminger wrote:
>> On Fri, 25 Dec 2009 22:23:51 -0500
>> Michael Breuer<mbreuer@...jas.com> wrote:
>>
>>> On 12/25/2009 6:22 PM, Stephen Hemminger wrote:
>>>> On Fri, 25 Dec 2009 11:28:55 -0500
>>>> Michael Breuer<mbreuer@...jas.com> wrote:
>>>>
>>>>
>>>>> More data points - I'm able to reliably recreate this now.
>>>>> While I thought it was coincidence, each and every time I hit this
>>>>> issue
>>>>> there is a DHCP renew event immediately before the first error.
>>>>> The crash occurs while under load - in my case seems that the
>>>>> traffic is
>>>>> actually IPV6 (hadn't noticed that before).
>>>>> I ran nethogs on a remote display - the reported rx rate on the
>>>>> IPV6 smb
>>>>> connection at the time of the lockup was 33889.688 KB/sec on a 1gbit
>>>>> nic. I've got two events like this - don't recall if the earlier
>>>>> one was
>>>>> the exact same # - but it was in the ballpark.
>>>>>
>>>>> On 12/24/2009 2:01 AM, Andrew Morton wrote:
>>>>>
>>>>>> cc's added again.
>>>>>>
>>>>>> On Wed, 23 Dec 2009 17:54:27 -0500 Michael
>>>>>> Breuer<mbreuer@...jas.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Ok - not the firmware. Ran another Windows backup and sky2 went
>>>>>>> down.
>>>>>>>
>>>>>>> Nothing in dmesg.old - have oops in syslog. System became
>>>>>>> unresponsive
>>>>>>> and watchdog kicked in after a minute.
>>>>>>>
>>>>>>> Also note that I have a similar oops with VT-D disabled (posted
>>>>>>> here on
>>>>>>> 12/5). I'm attaching the oops from that below this oops for
>>>>>>> comparison.
>>>>>>> That also happened under similar load.
>>>>>>>
>>>>>>> On the assumption that I can recreate this (although it takes a
>>>>>>> while)
>>>>>>> please let me know how I can help.
>>>>>>>
>>>>>>> What's in my log (starting with an smbd error about 2 min before
>>>>>>> the
>>>>>>> oops (note: the dchpd is not the system doing the backup).
>>>>>>>
>>>>>>>
>>>>>> This (nastily wordwrapped) oops appers to be quite different from
>>>>>> Berck's one.
>>>>>>
>>>>>>
>>>>>>
>>>> What is the MTU?
>>>>
>>> 1500
>>>>>
>> It looks like the problem only shows up for packets generated by DHCP,
>> and these come through AF_PACKET. The problem maybe related to how this
>> packets are fragmented into header and page, in a different way than
>> other
>> packets confusing the driver or DMA engine.
>>
>> Does this help?
>> -----
>>
>> --- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800
>> +++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800
>> @@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc
>> if (unlikely(tx_avail(sky2)< tx_le_req(skb)))
>> return NETDEV_TX_BUSY;
>>
>> + if (!pskb_may_pull(skb, ETH_HLEN)) {
>> + if (net_ratelimit())
>> + pr_info(PFX "%s: packet missing ether header (%d)?",
>> + dev->name, skb->len);
>> + goto drop;
>> + }
>> +
>> len = skb_headlen(skb);
>> mapping = pci_map_single(hw->pdev, skb->data, len,
>> PCI_DMA_TODEVICE);
>>
>> @@ -1761,6 +1768,7 @@ mapping_unwind:
>> mapping_error:
>> if (net_ratelimit())
>> dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name);
>> +drop:
>> dev_kfree_skb(skb);
>> return NETDEV_TX_OK;
>> }
>>
>>
>>
>>
> That seems to have done the trick!
>
> Still one odd message sequence, but no hangs or crashes.
>
> The first time I forced a DHCP renew while running at high throughput,
> I got the same SMB errors I saw in my original error log (pre-crash).
> This only happened once:
> Dec 26 15:24:18 mail dhcpd: DHCPACK on 10.0.0.56 to 00:1c:cc:f3:9f:f6
> (BLACKBERRY-9542) via eth0
> Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0]
> lib/util_sock.c:1564(matchname)
> Dec 26 15:24:25 mail smbd[8937]: matchname: host name/address
> mismatch: ::ffff:10.0.0.11 != potter.majjas.com
> Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0]
> lib/util_sock.c:1685(get_peer_name)
> Dec 26 15:24:25 mail smbd[8937]: Matchname failed on
> potter.majjas.com ::ffff:10.0.0.11
> Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0]
> smbd/nttrans.c:2076(call_nt_transact_ioctl)
> Dec 26 15:24:25 mail smbd[8937]: call_nt_transact_ioctl(0x900eb):
> Currently not implemented.
>
> I would discount this, but the same sequence was present in the logs
> pre-crash as well. I do not see this at all absent the preceding DHCP
> renew sequence. I also don't see this unless the adapter is under load.
>
>
>
Just an FYI - I left stuff running overnight and saw lots of this this
AM... ans one from last night:
Dec 26 21:39:54 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:42 mail kernel: sky2 0000:06:00.0: error interrupt
status=0x40000008
Dec 27 05:09:44 mail kernel: __ratelimit: 7 callbacks suppressed
... repeats for a couple of minutes. Never saw this before, but then
again, the system didn't stay up before.
I've also got this in dmesg (command, not /var/log/dmesg):
sky2 eth0: rx error, status 0x10660010 length 1518
sky2 eth0: rx error, status 0x10c60010 length 1518
sky2 eth0: rx error, status 0x14860010 length 1518
sky2 eth0: rx error, status 0x19e60010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x10660010 length 1518
sky2 eth0: rx error, status 0x10a60010 length 1518
sky2 eth0: rx error, status 0x10a60010 length 1518
sky2 eth0: rx error, status 0x11060010 length 1518
sky2 eth0: rx error, status 0x11060010 length 1518
DMA-API: debugging out of memory - disabling
... then later ...
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
DRHD: handling fault status reg 2
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da28000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 102
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da03000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 202
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9ec000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 302
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9a0000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 402
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d968000
DMAR:[fault reason 05] PTE Write access is not set
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164()
Hardware name: System Product Name
NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
Modules linked in: ip6table_mangle ip6table_filter ip6_tables
iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp
appletalk psnap llc fbcon tileblit font bitblit softcursor nfsd lockd
nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit
tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp
nf_conntrack_ipv6 xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath
kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi
gspca_spca505 snd_hda_intel snd_ac97_codec gspca_main snd_hda_codec
videodev v4l1_compat snd_hwdep ac97_bus snd_seq v4l2_compat_ioctl32
snd_seq_device snd_pcm firewire_ohci pcspkr firewire_core snd_timer snd
soundcore snd_page_alloc wmi sky2 asus_atk0110 i2c_i801 crc_itu_t hwmon
iTCO_wdt iTCO_vendor_support raid456 async_raid6_recov async_pq raid6_pq
async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi
pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit
cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables]
Pid: 25, comm: ksoftirqd/7 Tainted: G W
2.6.32-00830-gca45ac4-dirty #3
Call Trace:
<IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94
[<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43
[<ffffffff813e2dd7>] ? netif_tx_lock+0x44/0x6c
[<ffffffff813e2f3f>] dev_watchdog+0xf3/0x164
[<ffffffff8146b76a>] ? _spin_unlock_irqrestore+0x29/0x41
[<ffffffff8103c62e>] ? check_preempt_curr_idle+0x15/0x17
[<ffffffff8104357d>] ? task_rq_unlock+0x11/0x13
[<ffffffff8104e050>] ? try_to_wake_up+0x2a2/0x2b4
[<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270
[<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd
[<ffffffff81012e1c>] call_softirq+0x1c/0x30
<EOI> [<ffffffff810143a3>] do_softirq+0x4b/0xa6
[<ffffffff8105a730>] ksoftirqd+0x65/0xdc
[<ffffffff8105a6cb>] ? ksoftirqd+0x0/0xdc
[<ffffffff8107275c>] kthread+0x7f/0x87
[<ffffffff81012d1a>] child_rip+0xa/0x20
[<ffffffff8101267d>] ? restore_args+0x0/0x30
[<ffffffff810726dd>] ? kthread+0x0/0x87
[<ffffffff81012d10>] ? child_rip+0x0/0x20
---[ end trace 57f7151f6a5def07 ]---
sky2 eth0: tx timeout
sky2 eth0: transmit ring 27 .. 114 report=27 done=27
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
iint_free: readcount: 1
iint_free: opencount: 1
iint_free: readcount: 1
iint_free: opencount: 1
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
... this repeats for a while
I'm guessing that the dmesg and kernel messages correspond. System
stayed up and everything seems ok now. There were connections open at
the time of these events and they appear to have remained up.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists