[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871pmsfjye.fsf@toke.dk>
Date: Fri, 24 Oct 2025 16:33:29 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>, netdev@...r.kernel.org,
makita.toshiaki@....ntt.co.jp
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, Eric Dumazet
<eric.dumazet@...il.com>, "David S. Miller" <davem@...emloft.net>, Jakub
Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
ihor.solodrai@...ux.dev, toshiaki.makita1@...il.com, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid
txq getting stuck
Jesper Dangaard Brouer <hawk@...nel.org> writes:
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") introduced a race condition that can lead to a permanently
> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
> Max).
>
> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
> intended to re-wake the queue if the consumer had just emptied it (if
> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
> traffic halts.
>
> This failure is caused by an incorrect use of the __ptr_ring_empty() API
> from the producer side. As noted in kernel comments, this check is not
> guaranteed to be correct if a consumer is operating on another CPU. The
> empty test is based on ptr_ring->consumer_head, making it reliable only for
> the consumer. Using this check from the producer side is fundamentally racy.
>
> This patch fixes the race by adopting the more robust logic from an earlier
> version V4 of the patchset, which always flushed the peer:
>
> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
> are removed. Instead, after stopping the queue, we unconditionally call
> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
> making it solely responsible for re-waking the TXQ.
This makes sense.
> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
> called once per complete NAPI poll cycle.
So is this second point strictly necessary to fix the race, or is it
more of an optimisation?
> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
> about to complete (napi_complete_done), it now also checks if the peer TXQ
> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
> reschedule itself. This prevents a new race where the producer stops the
> queue just as the consumer is finishing its poll, ensuring the wakeup is not
> missed.
Also makes sense...
> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
> ---
> drivers/net/veth.c | 42 +++++++++++++++++++++---------------------
> 1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 3976ddda5fb8..1d70377481eb 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
> __skb_push(skb, ETH_HLEN);
> - /* Depend on prior success packets started NAPI consumer via
> - * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
> - * paired with empty check in veth_poll().
> - */
> netif_tx_stop_queue(txq);
> - smp_mb__after_atomic();
> - if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
> - netif_tx_wake_queue(txq);
> + /* Handle race: Makes sure NAPI peer consumer runs. Consumer is
> + * responsible for starting txq again, until then ndo_start_xmit
> + * (this function) will not be invoked by the netstack again.
> + */
> + __veth_xdp_flush(rq);
Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
the comment is clear enough without it, and since there's no explanation
of *which* race is being handled, it is just confusing, IMO.
> break;
> case NET_RX_DROP: /* same as NET_XMIT_DROP */
> drop:
> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> struct veth_xdp_tx_bq *bq,
> struct veth_stats *stats)
> {
> - struct veth_priv *priv = netdev_priv(rq->dev);
> - int queue_idx = rq->xdp_rxq.queue_index;
> - struct netdev_queue *peer_txq;
> - struct net_device *peer_dev;
> int i, done = 0, n_xdpf = 0;
> void *xdpf[VETH_XDP_BATCH];
>
> - /* NAPI functions as RCU section */
> - peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> - peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> -
> for (i = 0; i < budget; i++) {
> void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>
> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> rq->stats.vs.xdp_packets += done;
> u64_stats_update_end(&rq->stats.syncp);
>
> - if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
> - txq_trans_cond_update(peer_txq);
> - netif_tx_wake_queue(peer_txq);
> - }
> -
> return done;
> }
>
> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
> {
> struct veth_rq *rq =
> container_of(napi, struct veth_rq, xdp_napi);
> + struct veth_priv *priv = netdev_priv(rq->dev);
> + int queue_idx = rq->xdp_rxq.queue_index;
> + struct netdev_queue *peer_txq;
> struct veth_stats stats = {};
> + struct net_device *peer_dev;
> struct veth_xdp_tx_bq bq;
> int done;
>
> bq.count = 0;
>
> + /* NAPI functions as RCU section */
> + peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
> + peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
> +
> xdp_set_return_frame_no_direct();
> done = veth_xdp_rcv(rq, budget, &bq, &stats);
>
> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
> if (done < budget && napi_complete_done(napi, done)) {
> /* Write rx_notify_masked before reading ptr_ring */
> smp_store_mb(rq->rx_notify_masked, false);
> - if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> + if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
> + (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
> if (napi_schedule_prep(&rq->xdp_napi)) {
> WRITE_ONCE(rq->rx_notify_masked, true);
> __napi_schedule(&rq->xdp_napi);
> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
> veth_xdp_flush(rq, &bq);
> xdp_clear_return_frame_no_direct();
>
> + /* Release backpressure per NAPI poll */
> + if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
> + txq_trans_cond_update(peer_txq);
> + netif_tx_wake_queue(peer_txq);
> + }
> +
> return done;
> }
>
Powered by blists - more mailing lists