linux-kernel - RE: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87618083B2453E4A8714035B62D67992502411C3@FMSMSX105.amr.corp.intel.com>
Date:	Mon, 23 Feb 2015 16:42:52 +0000
From:	"Tantilov, Emil S" <emil.s.tantilov@...el.com>
To:	Justin Piszcz <jpiszcz@...idpixels.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx
 timeout

>-----Original Message-----
>From: linux-kernel-owner@...r.kernel.org [mailto:linux-kernel-owner@...r.kernel.org] On Behalf Of Justin Piszcz
>Sent: Sunday, February 22, 2015 4:01 AM
>To: linux-kernel@...r.kernel.org
>Subject: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
>
>Hello,
>
>Kernel: 3.19.0
>Issue: When using robocopy to copy files (from Windows 8/8.1) to
>Linux/samba, the 10GbE NIC resets - dmesg [1] below.  To get it back working
>again, I have to down/up the interface.  Jumbo frames are being used (mtu of
>9014) on each side. The lspci output is listed below.  Are there any other
>recommended workarounds for this issue as LRO is already off for me as shown
>below.  When using Linux<->Linux with rsync or NFS, there are no errors with
>10GbE.  When using Samba<->Windows 8 over 10GbE, this issue occurs
>persistently as shown below when a copy is running.
>
># ethtool -k eth4|grep large
>large-receive-offload: off [fixed]

The issue is a Tx timeout, so LRO is unlikely to have an effect. Is the interface that hangs (eth4) mostly receiving or transmitting? Posting the stats (ethtool -S eth4) would help here.

>There is/was a similar issue as reported here:
>https://communities.intel.com/message/207408
>
> [1] dmesg
>
> [538576.098186] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [541013.223961] ------------[ cut here ]------------
> [541013.223970] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x227/0x230()
> [541013.223971] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 0 timed out
> [541013.223972] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0 #2
> [541013.223973] Hardware name: Supermicro X9SRL-F/X9SRL-F, BIOS 3.0a 12/05/2013
> [541013.223974]  ffffffff81d3a6ae ffff88107fc03da8 ffffffff819d07d7 ffffffff81e34d98
> [541013.223976]  ffff88107fc03df8 ffff88107fc03de8 ffffffff810dbdab 0000000000000000
> [541013.223977]  0000000000000000 ffff881036304000 0000000000000000 0000000000000010
> [541013.223979] Call Trace:
> [541013.223979]  <IRQ>  [<ffffffff819d07d7>] dump_stack+0x45/0x57
> [541013.223985]  [<ffffffff810dbdab>] warn_slowpath_common+0x7b/0xc0
> [541013.223987]  [<ffffffff810dbe61>] warn_slowpath_fmt+0x41/0x50
> [541013.223990]  [<ffffffff810eec4c>] ? __queue_work+0xfc/0x290
> [541013.223996]  [<ffffffff818ef0a7>] dev_watchdog+0x227/0x230
> [541013.223997]  [<ffffffff818eee80>] ? qdisc_rcu_free+0x40/0x40
> [541013.223998]  [<ffffffff818eee80>] ? qdisc_rcu_free+0x40/0x40
> [541013.224001]  [<ffffffff811251f7>] call_timer_fn.isra.29+0x17/0x80
> [541013.224002]  [<ffffffff81125429>] run_timer_softirq+0x1c9/0x280
> [541013.224004]  [<ffffffff810dec7f>] __do_softirq+0xff/0x200
> [541013.224005]  [<ffffffff810deea6>] irq_exit+0x76/0xa0
> [541013.224007]  [<ffffffff8106ac11>] smp_apic_timer_interrupt+0x41/0x50
> [541013.224009]  [<ffffffff819da6aa>] apic_timer_interrupt+0x6a/0x70
> [541013.224009]  <EOI>  [<ffffffff8184e8f8>] ? cpuidle_enter_state+0x48/0xc0
> [541013.224013]  [<ffffffff8184e8ed>] ? cpuidle_enter_state+0x3d/0xc0
> [541013.224014]  [<ffffffff8184ea42>] cpuidle_enter+0x12/0x20
> [541013.224017]  [<ffffffff8110f222>] cpu_startup_entry+0x272/0x2f0
> [541013.224018]  [<ffffffff819cdd5d>] rest_init+0x6d/0x70
> [541013.224021]  [<ffffffff81ef0dbb>] start_kernel+0x353/0x360
> [541013.224022]  [<ffffffff81ef0495>] x86_64_start_reservations+0x2a/0x2c
> [541013.224023]  [<ffffffff81ef055f>] x86_64_start_kernel+0xc8/0xcc
> [541013.224024] ---[ end trace 59877113cf8b7358 ]---
> [541013.224026] ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
> [541013.224036] ixgbe 0000:01:00.0 eth4: Reset adapter
> [541020.099402] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
>
> ( .. it continue but without the trace later .. )
>
> [567457.771728] ixgbe 0000:01:00.0 eth4: NIC Link is Down
> [567458.140112] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [567561.611941] ixgbe 0000:01:00.0 eth4: NIC Link is Down
> [567568.188422] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [570130.483823] ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
> [570130.483924] ixgbe 0000:01:00.0 eth4: Reset adapter

The reset is a side effect of the Tx hang - the driver is trying to recover from the hang by resetting the interface.

If you could open up a ticket at e1000.sf.net with details about your setup and how you configure the interfaces that would help us get a better idea of the issue. You can also upload the stats, kernel config and any other logs that may be relevant.

Thanks,
Emil


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/