netdev - Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z/jPgceDT4gRu9/R@localhost.localdomain>
Date: Fri, 11 Apr 2025 10:14:57 +0200
From: Michal Kubiak <michal.kubiak@...el.com>
To: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
CC: Tony Nguyen <anthony.l.nguyen@...el.com>, Jay Vosburgh <jv@...sburgh.net>,
	Przemek Kitszel <przemyslaw.kitszel@...el.com>, Andrew Lunn
	<andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, "Eric
 Dumazet" <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann
	<daniel@...earbox.net>, Jesper Dangaard Brouer <hawk@...nel.org>, "John
 Fastabend" <john.fastabend@...il.com>, <intel-wired-lan@...ts.osuosl.org>,
	<netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <sdn@...zner-cloud.de>
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> >> Hi,
> >>
> >> in a setup where I use native XDP to redirect packets to a bonding interface
> >> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
> >> resets the NIC with the following kernel output:
> >>
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> >>     Tx Queue             <4>
> >>     TDH, TDT             <17e>, <17e>
> >>     next_to_use          <181>
> >>     next_to_clean        <17e>
> >>   tx_buffer_info[next_to_clean]
> >>     time_stamp           <0>
> >>     jiffies              <10025c380>
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
> >>   ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> >>
> >> This only occurs in combination with a bonding interface and XDP, so I don't
> >> know if this is an issue with ixgbe or the bonding driver.
> >> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
> >> show the same issue.
> >>
> >>
> >> I managed to reproduce this bug in a lab environment. Here are some details
> >> about my setup and the steps to reproduce the bug:
> >>
> >> [...]
> >>
> >> Do you have any ideas what may be causing this issue or what I can do to
> >> diagnose this further?
> >>
> >> Please let me know when I should provide any more information.
> >>
> >>
> >> Thanks!
> >> Marcus
> >>
> > 
> > Hi Marcus,
> 
> Hi Michal,
> 
> thank you for looking into it. And not even 24 hours after my report, I'm
> very impressed! ;)
> 
> > I have just successfully reproduced the problem on our lab machine. What
> > is interesting is that I do not seem to have to use a bonding interface
> > to get the "Tx timeout" that causes the adapter to reset.
> 
> Interesting. I just tried again but had no luck yet with reproducing it
> without a bonding interface. May I ask how your setup looks like?
> 
> > I will try to debug the problem more closely and let you know of any
> > updates.
> > 
> > Thanks,
> > Michal
> 
> Great!
> 
> Marcus
>

Hi Marcus,

> thank you for looking into it. And not even 24 hours after my report, I'm
> very impressed! ;)

Thanks! :-)

> Interesting. I just tried again but had no luck yet with reproducing it
> without a bonding interface. May I ask how your setup looks like?

For now, I've just grabbed the first available system with the HW
controlled by the "ixgbe" driver. In my case it was:

  Ethernet controller: Intel Corporation Ethernet Controller X550

Also, for my first attempt, I didn't use the upstream kernel - I just tried
the kernel installed on that system. It was the Fedora kernel:

  6.12.8-200.fc41.x86_64


I think that may be the "beauty" of timing issues - sometimes you can change
just one piece in your system and get a completely different replication ratio.
Anyway, the higher the repro probability, the easier it is to debug
the timing problem. :-)

Thanks,
Michal