[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d33f0ab4-4dc4-49cd-bbd0-055f58dd6758@hetzner-cloud.de>
Date: Wed, 9 Apr 2025 17:17:49 +0200
From: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
To: Tony Nguyen <anthony.l.nguyen@...el.com>, Jay Vosburgh
<jv@...sburgh.net>, Przemek Kitszel <przemyslaw.kitszel@...el.com>,
Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Jesper Dangaard Brouer <hawk@...nel.org>,
John Fastabend <john.fastabend@...il.com>
Cc: intel-wired-lan@...ts.osuosl.org, netdev@...r.kernel.org,
bpf@...r.kernel.org, linux-kernel@...r.kernel.org, sdn@...zner-cloud.de
Subject: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
Hi,
in a setup where I use native XDP to redirect packets to a bonding interface
that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
resets the NIC with the following kernel output:
ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
Tx Queue <4>
TDH, TDT <17e>, <17e>
next_to_use <181>
next_to_clean <17e>
tx_buffer_info[next_to_clean]
time_stamp <0>
jiffies <10025c380>
ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
This only occurs in combination with a bonding interface and XDP, so I don't
know if this is an issue with ixgbe or the bonding driver.
I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
show the same issue.
I managed to reproduce this bug in a lab environment. Here are some details
about my setup and the steps to reproduce the bug:
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
Also reproduced on:
- Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
- Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Kernel: 6.15.0-rc1 (built from mainline)
# ethtool -i ixgbe-x520-1
driver: ixgbe
version: 6.15.0-rc1
firmware-version: 0x00012b2c, 1.3429.0
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
connected with each other using a DAC cable. Both ports are configured to be
slaves of a bonding with mode balance-rr.
Neither the direct connection of both ports nor the round-robin bonding mode
are a requirement to reproduce the issue. This setup just allows it to be easier
reproduced in an isolated environment. The issue is also visible with a regular
802.3ad link aggregation with a switch on the other side.
# modprobe bonding
# ip link set dev ixgbe-x520-1 down
# ip link set dev ixgbe-x520-2 down
# ip link add bond0 type bond mode balance-rr
# ip link set dev ixgbe-x520-1 master bond0
# ip link set dev ixgbe-x520-2 master bond0
# ip link set dev ixgbe-x520-1 up
# ip link set dev ixgbe-x520-2 up
# ip link set dev bond0 up
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.15.0-rc1
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: ixgbe-x520-1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 6c:b3:11:08:5c:3c
Slave queue ID: 0
Slave Interface: ixgbe-x520-2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 6c:b3:11:08:5c:3e
Slave queue ID: 0
# ethtool -l ixgbe-x520-1
Channel parameters for ixgbe-x520-1:
Pre-set maximums:
RX: n/a
TX: n/a
Other: 1
Combined: 63
Current hardware settings:
RX: n/a
TX: n/a
Other: 1
Combined: 63
(same for ixgbe-x520-2)
In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
are used.
Enable XDP on the bonding and make sure all received packets will be dropped:
# xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
Redirect a batch of packets to the bonding interface:
# xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
--src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
(see above) will show up in the kernel log.
The high number of packets and thread count (--threads 16) is not required to
trigger the issue but greatly improves the probability.
Do you have any ideas what may be causing this issue or what I can do to
diagnose this further?
Please let me know when I should provide any more information.
Thanks!
Marcus
Powered by blists - more mailing lists