[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ue0ym7Q=R2KWqbeDNr8D22toe1Lqq5yHeykjvutirx=+A@mail.gmail.com>
Date: Wed, 24 Jan 2018 08:11:58 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Ben Greear <greearb@...delatech.com>,
intel-wired-lan <intel-wired-lan@...ts.osuosl.org>,
e1000-devel@...ts.sourceforge.net
Cc: netdev <netdev@...r.kernel.org>,
"Neftin, Sasha" <sasha.neftin@...el.com>
Subject: Re: e1000e hardware unit hangs
On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@...delatech.com> wrote:
> Hello,
>
> Anyone have any more suggestions for making e1000e work better? This is
> from a 4.9.65+ kernel,
> with these additional e1000e patches applied:
>
> e1000e: Fix error path in link detection
> e1000e: Fix wrong comment related to link detection
> e1000e: Fix return value test
> e1000e: Separate signaling for link check/link up
> e1000e: Avoid receiver overrun interrupt bursts
Most of these patches shouldn't address anything that would trigger Tx
hangs. They are mostly related to just link detection.
> Test case is simply to run 30000 tcp connections each trying to send 56Kbps
> of bi-directional
> data between a pair of e1000e interfaces :)
>
> No OOM related issues are seen on this kernel...similar test on 4.13 showed
> some OOM
> issues, but I have not debugged that yet...
Really a question like this probably belongs on e1000-devel or
intel-wired-lan so I have added those lists and the e1000e maintainer
to the thread.
It would be useful if you could provide more information about the
device itself such as the ID and the kind of test you are running.
Keep in mind the e1000e driver supports a pretty broad swath of
devices so we need to narrow things down a bit.
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
> (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout:
> 5000 jiffies: 4294745088 tx-queues: 1
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2
> (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout:
> 5000 jiffies: 4294745088 tx-queues: 1
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here
> ]------------
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0
> at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322
> dev_watchdog+0x267/0x270
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in:
> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep
> bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm:
> swapper/7 Tainted: G O 4.9.65+ #21
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name:
> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0
> ffffffff8142d791 0000000000000000 0000000000000000
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30
> ffffffff8110f266 000001422fdc3e08 0000000000000000
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388
> 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ>
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>]
> dump_stack+0x63/0x82
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>]
> __warn+0xc6/0xe0
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>]
> warn_slowpath_null+0x18/0x20
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>]
> dev_watchdog+0x267/0x270
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ?
> qdisc_rcu_free+0x40/0x40
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>]
> call_timer_fn+0x30/0x150
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ?
> qdisc_rcu_free+0x40/0x40
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>]
> run_timer_softirq+0x1f0/0x450
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ?
> lapic_next_deadline+0x21/0x30
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ?
> clockevents_program_event+0x7d/0x120
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>]
> __do_softirq+0xc1/0x2c0
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>]
> irq_exit+0xb1/0xc0
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>]
> smp_apic_timer_interrupt+0x3d/0x50
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>]
> apic_timer_interrupt+0x82/0x90
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI>
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ?
> cpuidle_enter_state+0x126/0x300
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>]
> cpuidle_enter+0x12/0x20
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>]
> call_cpuidle+0x1e/0x40
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>]
> cpu_startup_entry+0x13a/0x220
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>]
> start_secondary+0x149/0x170
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace
> 69e31de175b59d4f ]---
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0
> eth2: Reset adapter unexpectedly
> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0
> eth3: Reset adapter unexpectedly
> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0
> eth2: Detected Hardware Unit Hang:
> TDH
> <a8>
> TDT
> <f3>...
> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout:
> 5000 jiffies: 4294759424 tx-queues: 1
> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2
> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout:
> 5000 jiffies: 4294759424 tx-queues: 1
> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0
> eth3: Reset adapter unexpectedly
> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0
> eth2: Reset adapter unexpectedly
> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2
> (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout:
> 5000 jiffies: 4294771200 tx-queues: 1
> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
> (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout:
> 5000 jiffies: 4294771200 tx-queues: 1
> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0
> eth2: Reset adapter unexpectedly
> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0
> eth3: Reset adapter unexpectedly
> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0
> eth2: Detected Hardware Unit Hang:
> TDH
> <c8>
> TDT
> <f5>...
> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb@...delatech.com>
> Candela Technologies Inc http://www.candelatech.com
>
Powered by blists - more mailing lists