[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z/fWHYETBYQuCno5@localhost.localdomain>
Date: Thu, 10 Apr 2025 16:30:53 +0200
From: Michal Kubiak <michal.kubiak@...el.com>
To: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
CC: Tony Nguyen <anthony.l.nguyen@...el.com>, Jay Vosburgh <jv@...sburgh.net>,
Przemek Kitszel <przemyslaw.kitszel@...el.com>, Andrew Lunn
<andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, "Eric
Dumazet" <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann
<daniel@...earbox.net>, Jesper Dangaard Brouer <hawk@...nel.org>, "John
Fastabend" <john.fastabend@...il.com>, <intel-wired-lan@...ts.osuosl.org>,
<netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <sdn@...zner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)
On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> Hi,
>
> in a setup where I use native XDP to redirect packets to a bonding interface
> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> resets the NIC with the following kernel output:
>
> ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> Tx Queue <4>
> TDH, TDT <17e>, <17e>
> next_to_use <181>
> next_to_clean <17e>
> tx_buffer_info[next_to_clean]
> time_stamp <0>
> jiffies <10025c380>
> ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>
> This only occurs in combination with a bonding interface and XDP, so I don't
> know if this is an issue with ixgbe or the bonding driver.
> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> show the same issue.
>
>
> I managed to reproduce this bug in a lab environment. Here are some details
> about my setup and the steps to reproduce the bug:
>
> NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
>
> CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
> Also reproduced on:
> - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>
> Kernel: 6.15.0-rc1 (built from mainline)
>
> # ethtool -i ixgbe-x520-1
> driver: ixgbe
> version: 6.15.0-rc1
> firmware-version: 0x00012b2c, 1.3429.0
> expansion-rom-version:
> bus-info: 0000:01:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
> connected with each other using a DAC cable. Both ports are configured to be
> slaves of a bonding with mode balance-rr.
> Neither the direct connection of both ports nor the round-robin bonding mode
> are a requirement to reproduce the issue. This setup just allows it to be easier
> reproduced in an isolated environment. The issue is also visible with a regular
> 802.3ad link aggregation with a switch on the other side.
>
> # modprobe bonding
> # ip link set dev ixgbe-x520-1 down
> # ip link set dev ixgbe-x520-2 down
> # ip link add bond0 type bond mode balance-rr
> # ip link set dev ixgbe-x520-1 master bond0
> # ip link set dev ixgbe-x520-2 master bond0
> # ip link set dev ixgbe-x520-1 up
> # ip link set dev ixgbe-x520-2 up
> # ip link set dev bond0 up
>
> # cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v6.15.0-rc1
>
> Bonding Mode: load balancing (round-robin)
> MII Status: up
> MII Polling Interval (ms): 0
> Up Delay (ms): 0
> Down Delay (ms): 0
> Peer Notification Delay (ms): 0
>
> Slave Interface: ixgbe-x520-1
> MII Status: up
> Speed: 10000 Mbps
> Duplex: full
> Link Failure Count: 0
> Permanent HW addr: 6c:b3:11:08:5c:3c
> Slave queue ID: 0
>
> Slave Interface: ixgbe-x520-2
> MII Status: up
> Speed: 10000 Mbps
> Duplex: full
> Link Failure Count: 0
> Permanent HW addr: 6c:b3:11:08:5c:3e
> Slave queue ID: 0
>
> # ethtool -l ixgbe-x520-1
> Channel parameters for ixgbe-x520-1:
> Pre-set maximums:
> RX: n/a
> TX: n/a
> Other: 1
> Combined: 63
> Current hardware settings:
> RX: n/a
> TX: n/a
> Other: 1
> Combined: 63
> (same for ixgbe-x520-2)
>
> In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
> are used.
>
> Enable XDP on the bonding and make sure all received packets will be dropped:
> # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0
>
> Redirect a batch of packets to the bonding interface:
> # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
> --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0
>
> Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
> (see above) will show up in the kernel log.
>
> The high number of packets and thread count (--threads 16) is not required to
> trigger the issue but greatly improves the probability.
>
>
> Do you have any ideas what may be causing this issue or what I can do to
> diagnose this further?
>
> Please let me know when I should provide any more information.
>
>
> Thanks!
> Marcus
>
Hi Marcus,
Thank you for reporting this issue!
I have just successfully reproduced the problem on our lab machine. What
is interesting is that I do not seem to have to use a bonding interface
to get the "Tx timeout" that causes the adapter to reset.
I will try to debug the problem more closely and let you know of any
updates.
Thanks,
Michal
Powered by blists - more mailing lists