netdev - Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eca1880f-253a-4955-afe6-732d7c6926ee@hetzner-cloud.de>
Date: Thu, 24 Apr 2025 12:19:35 +0200
From: Tobias Böhm <tobias.boehm@...zner-cloud.de>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
 Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
Cc: Michal Kubiak <michal.kubiak@...el.com>,
 Tony Nguyen <anthony.l.nguyen@...el.com>, Jay Vosburgh <jv@...sburgh.net>,
 Przemek Kitszel <przemyslaw.kitszel@...el.com>,
 Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
 Jesper Dangaard Brouer <hawk@...nel.org>,
 John Fastabend <john.fastabend@...il.com>, intel-wired-lan@...ts.osuosl.org,
 netdev@...r.kernel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
 sdn@...zner-cloud.de
Subject: Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
>> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
>>> On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
>>>> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
>>>>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
>>>>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> in a setup where I use native XDP to redirect packets to a bonding interface
>>>>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly
>>>>>>> resets the NIC with the following kernel output:
>>>>>>>
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>>>>>>      Tx Queue             <4>
>>>>>>>      TDH, TDT             <17e>, <17e>
>>>>>>>      next_to_use          <181>
>>>>>>>      next_to_clean        <17e>
>>>>>>>    tx_buffer_info[next_to_clean]
>>>>>>>      time_stamp           <0>
>>>>>>>      jiffies              <10025c380>
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout
>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>>>>>>
>>>>>>> This only occurs in combination with a bonding interface and XDP, so I don't
>>>>>>> know if this is an issue with ixgbe or the bonding driver.
>>>>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1
>>>>>>> show the same issue.
>>>>>>>
>>>>>>>
>>>>>>> I managed to reproduce this bug in a lab environment. Here are some details
>>>>>>> about my setup and the steps to reproduce the bug:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> Do you have any ideas what may be causing this issue or what I can do to
>>>>>>> diagnose this further?
>>>>>>>
>>>>>>> Please let me know when I should provide any more information.
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Marcus
>>>>>>>
>>>>>>
>>>> [...]
>>>>
>>>> Hi Marcus,
>>>>
>>>>> thank you for looking into it. And not even 24 hours after my report, I'm
>>>>> very impressed! ;)
>>>>
>>>> Thanks! :-)
>>>>
>>>>> Interesting. I just tried again but had no luck yet with reproducing it
>>>>> without a bonding interface. May I ask how your setup looks like?
>>>>
>>>> For now, I've just grabbed the first available system with the HW
>>>> controlled by the "ixgbe" driver. In my case it was:
>>>>
>>>>    Ethernet controller: Intel Corporation Ethernet Controller X550
>>>>
>>>> Also, for my first attempt, I didn't use the upstream kernel - I just tried
>>>> the kernel installed on that system. It was the Fedora kernel:
>>>>
>>>>    6.12.8-200.fc41.x86_64
>>>>
>>>>
>>>> I think that may be the "beauty" of timing issues - sometimes you can change
>>>> just one piece in your system and get a completely different replication ratio.
>>>> Anyway, the higher the repro probability, the easier it is to debug
>>>> the timing problem. :-)
>>>
>>> Hi Marcus, to break the silence could you try to apply the diff below on
>>> your side?
>>
>> Hi, thank you for the patch. We've tried it and with your changes we can no
>> longer trigger the error and the NIC is no longer being reset.
>>
>>> We see several issues around XDP queues in ixgbe, but before we
>>> proceed let's this small change on your side.
>>
>> How confident are you that this patch is sufficient to make things stable enough
>> for production use? Was it just the Tx hang detection that was misbehaving for
>> the XDP case, or is there an underlying issue with the XDP queues that is not
>> solved by disabling the detection for it?
> 
> I believe that correct way to approach this is to move the Tx hang
> detection onto ixgbe_tx_timeout() as that is the place where this logic
> belongs to. By doing so I suppose we would kill two birds with one stone
> as mentioned ndo is called under netdev watchdog which is not a subject
> for XDP Tx queues.
> 
>>
>> With our current setup we cannot verify accurately, that we have no packet loss
>> or stuck queues. We can do additional tests to verify that.


Hi Maciej,

I'm a colleague of Marcus and involved in the testing as well.
>>> Additional question, do you have enabled pause frames on your setup?
>>
>> Pause frames were enabled, but we can also reproduce it after disabling them,
>> without your patch.
> 
> Please give your setup a go with pause frames enabled and applied patch
> that i shared previously and let us see the results. As said above I do
> not think it is correct to check for hung queues in Tx descriptor cleaning
> routine. This is a job of ndo_tx_timeout callback.
> 

We have tested with pause frames enabled and applied patch and can not 
trigger the error anymore in our lab setup.

>>
>> Thanks!
> 
> Thanks for feedback and testing. I'll provide a proper fix tomorrow and CC
> you so you could take it for a spin.
> 

That sounds great. We'd be happy to test with the proper fix in our 
original setup.

Thanks,
Tobias

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (841 bytes)