linux-kernel - Re: [PATCH net V3 1/2] veth: enable dev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b9f01e64-f7cc-4f5a-9716-5767b37e2245@kernel.org>
Date: Fri, 7 Nov 2025 14:42:58 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, Toke Høiland-Jørgensen
 <toke@...e.dk>, Eric Dumazet <eric.dumazet@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
 ihor.solodrai@...ux.dev, "Michael S. Tsirkin" <mst@...hat.com>,
 makita.toshiaki@....ntt.co.jp, toshiaki.makita1@...il.com,
 bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-arm-kernel@...ts.infradead.org, kernel-team@...udflare.com
Subject: Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting
 stalled TXQs



On 07/11/2025 02.29, Jakub Kicinski wrote:
> On Wed, 05 Nov 2025 18:28:12 +0100 Jesper Dangaard Brouer wrote:
>> The changes introduced in commit dc82a33297fc ("veth: apply qdisc
>> backpressure on full ptr_ring to reduce TX drops") have been found to cause
>> a race condition in production environments.
>>
>> Under specific circumstances, observed exclusively on ARM64 (aarch64)
>> systems with Ampere Altra Max CPUs, a transmit queue (TXQ) can become
>> permanently stalled. This happens when the race condition leads to the TXQ
>> entering the QUEUE_STATE_DRV_XOFF state without a corresponding queue wake-up,
>> preventing the attached qdisc from dequeueing packets and causing the
>> network link to halt.
>>
>> As a first step towards resolving this issue, this patch introduces a
>> failsafe mechanism. It enables the net device watchdog by setting a timeout
>> value and implements the .ndo_tx_timeout callback.
>>
>> If a TXQ stalls, the watchdog will trigger the veth_tx_timeout() function,
>> which logs a warning and calls netif_tx_wake_queue() to unstall the queue
>> and allow traffic to resume.
>>
>> The log message will look like this:
>>
>>   veth42: NETDEV WATCHDOG: CPU: 34: transmit queue 0 timed out 5393 ms
>>   veth42: veth backpressure stalled(n:1) TXQ(0) re-enable
>>
>> This provides a necessary recovery mechanism while the underlying race
>> condition is investigated further. Subsequent patches will address the root
>> cause and add more robust state handling.
>>
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Reviewed-by: Toke Høiland-Jørgensen <toke@...hat.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
> 
> I think this belongs in net-next.. Fail safe is not really a bug fix.
> I'm slightly worried we're missing a corner case and will cause
> timeouts to get printed for someone's config.
> 

This is a recovery fix.  If the race condition fix isn't 100% then this
patch will allow veth to recover.  Thus, to me it makes sense to group
these two patches together.

I'm more worried that we we're missing a corner case that we cannot
recover from. Than triggering timeouts to get printed, for a config
where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
max 64 packets this needs to consume packets at a rate less than 12.8
pps). It might be good to get some warnings if the system is operating
this slow.

Also remember this is not the default config that most people use.
The code is only activated if attaching a qdisc to veth, which isn't
default. Plus, NAPI mode need to be activated, where in normal NAPI mode
the producer and consumer usually runs on the same CPU, which makes it
impossible to overflow the ptr_ring.  The veth backpressure is primarily
needed when running with threaded-NAPI, where it is natural that
producer and consumer runs on different CPUs. In our production setup
the consumer is always slower than the producer (as the product inside
the namespace have installed too many nftables rules).


>> +static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
>> +{
>> +	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
>> +
>> +	netdev_err(dev, "veth backpressure stalled(n:%ld) TXQ(%u) re-enable\n",
>> +		   atomic_long_read(&txq->trans_timeout), txqueue);
> 
> If you think the trans_timeout is useful, let's add it to the message
> core prints? And then we can make this msg just veth specific, I don't
> think we should be repeating what core already printed.

The trans_timeout is a counter for how many times this TXQ have seen a
timeout.  It is practical as it directly tell us if this a frequent
event (without having to search log files for similar events).

It does make sense to add this to the core message ("NETDEV WATCHDOG")
with the same argument.  For physical NICs these logs are present in
production. Looking at logs through Kibana (right now) and it would make
my life easier to see the number of times the individual queues have
experienced timeouts.  The logs naturally gets spaced in time by the
timeout, making it harder to tell the even frequency. Such a patch would
naturally go though net-next.

Do you still want me to remove the frequency counter from this message?
By the same argument it is practical for me to have as a single log line
when troubleshooting this in practice.  BTW, I've already backported
this watchdog patch to prod kernel (without race fix) and I'll try to
reproduce the race in staging/lab on some ARM64 servers.  If I reproduce
it will be practical to have this counter.

--Jesper