linux-kernel - Re: [PATCH net V3 2/2] veth: more robust handing of race to avoid txq getting stuck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4abd5327-ccb7-4dbc-9b09-e98069312e2f@gmail.com>
Date: Thu, 6 Nov 2025 23:14:11 +0900
From: Toshiaki Makita <toshiaki.makita1@...il.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, ihor.solodrai@...ux.dev,
 "Michael S. Tsirkin" <mst@...hat.com>, makita.toshiaki@....ntt.co.jp,
 bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-arm-kernel@...ts.infradead.org, kernel-team@...udflare.com,
 Toke Høiland-Jørgensen <toke@...e.dk>,
 netdev@...r.kernel.org
Subject: Re: [PATCH net V3 2/2] veth: more robust handing of race to avoid txq
 getting stuck

On 2025/11/06 2:28, Jesper Dangaard Brouer wrote:
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
> 
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
> 
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
> 
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
> 
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
>    This handles the race where veth_poll() consumes all packets and completes
> NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue.
> The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule
> NAPI.
> 
> (2) On the consumer side, the logic for waking the peer TXQ is moved out of
> veth_xdp_rcv() and placed at the end of the veth_poll() function. This
> placement is part of fixing the race, as the netif_tx_queue_stopped() check
> must occur after rx_notify_masked is potentially set to false during NAPI
> completion.
>    This handles the race where veth_poll() consumes all packets, but haven't
> finished (rx_notify_masked is still true). The producer veth_xmit() stops the
> TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning
> not starting NAPI.  Then veth_poll() change rx_notify_masked to false and
> stops NAPI.  Before exiting veth_poll() will observe TXQ is stopped and wake
> it up.
> 
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>

Reviewed-by: Toshiaki Makita <toshiaki.makita1@...il.com>