lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d33f0ab4-4dc4-49cd-bbd0-055f58dd6758@hetzner-cloud.de>
Date: Wed, 9 Apr 2025 17:17:49 +0200
From: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
To: Tony Nguyen <anthony.l.nguyen@...el.com>, Jay Vosburgh
 <jv@...sburgh.net>, Przemek Kitszel <przemyslaw.kitszel@...el.com>,
 Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
 Jesper Dangaard Brouer <hawk@...nel.org>,
 John Fastabend <john.fastabend@...il.com>
Cc: intel-wired-lan@...ts.osuosl.org, netdev@...r.kernel.org,
 bpf@...r.kernel.org, linux-kernel@...r.kernel.org, sdn@...zner-cloud.de
Subject: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

Hi,

in a setup where I use native XDP to redirect packets to a bonding interface
that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
resets the NIC with the following kernel output:

  ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
    Tx Queue             <4>
    TDH, TDT             <17e>, <17e>
    next_to_use          <181>
    next_to_clean        <17e>
  tx_buffer_info[next_to_clean]
    time_stamp           <0>
    jiffies              <10025c380>
  ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
  ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
  ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter

This only occurs in combination with a bonding interface and XDP, so I don't
know if this is an issue with ixgbe or the bonding driver.
I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
show the same issue.


I managed to reproduce this bug in a lab environment. Here are some details
about my setup and the steps to reproduce the bug:

NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz
     Also reproduced on:
     - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
     - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Kernel: 6.15.0-rc1 (built from mainline)

  # ethtool -i ixgbe-x520-1
  driver: ixgbe
  version: 6.15.0-rc1
  firmware-version: 0x00012b2c, 1.3429.0
  expansion-rom-version: 
  bus-info: 0000:01:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly
connected with each other using a DAC cable. Both ports are configured to be
slaves of a bonding with mode balance-rr.
Neither the direct connection of  both ports nor the round-robin bonding mode
are a requirement to reproduce the issue. This setup just allows it to be easier
reproduced in an isolated environment. The issue is also visible with a regular
802.3ad link aggregation with a switch on the other side.

  # modprobe bonding
  # ip link set dev ixgbe-x520-1 down
  # ip link set dev ixgbe-x520-2 down
  # ip link add bond0 type bond mode balance-rr
  # ip link set dev ixgbe-x520-1 master bond0
  # ip link set dev ixgbe-x520-2 master bond0
  # ip link set dev ixgbe-x520-1 up
  # ip link set dev ixgbe-x520-2 up
  # ip link set dev bond0 up
        
  # cat /proc/net/bonding/bond0
  Ethernet Channel Bonding Driver: v6.15.0-rc1

  Bonding Mode: load balancing (round-robin)
  MII Status: up
  MII Polling Interval (ms): 0
  Up Delay (ms): 0
  Down Delay (ms): 0
  Peer Notification Delay (ms): 0

  Slave Interface: ixgbe-x520-1
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3c
  Slave queue ID: 0

  Slave Interface: ixgbe-x520-2
  MII Status: up
  Speed: 10000 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: 6c:b3:11:08:5c:3e
  Slave queue ID: 0

  # ethtool -l ixgbe-x520-1
  Channel parameters for ixgbe-x520-1:
  Pre-set maximums:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  Current hardware settings:
  RX:             n/a
  TX:             n/a
  Other:          1
  Combined:       63
  (same for ixgbe-x520-2)

In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/
are used.

Enable XDP on the bonding and make sure all received packets will be dropped:
  # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0

Redirect a batch of packets to the bonding interface:
  # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0>
    --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0

Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors
(see above) will show up in the kernel log.

The high number of packets and thread count (--threads 16) is not required to
trigger the issue but greatly improves the probability.


Do you have any ideas what may be causing this issue or what I can do to
diagnose this further?

Please let me know when I should provide any more information.


Thanks!
Marcus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ