linux-kernel - Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq getting stuck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b021f5c3-5105-445d-b919-8282363a19fc@kernel.org>
Date: Mon, 27 Oct 2025 13:19:32 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Toke Høiland-Jørgensen <toke@...hat.com>,
 netdev@...r.kernel.org, makita.toshiaki@....ntt.co.jp
Cc: Eric Dumazet <eric.dumazet@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, ihor.solodrai@...ux.dev,
 toshiaki.makita1@...il.com, bpf@...r.kernel.org,
 linux-kernel@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: [PATCH net V1 3/3] veth: more robust handing of race to avoid txq
 getting stuck




On 24/10/2025 16.33, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@...nel.org> writes:
> 
>> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>> reduce TX drops") introduced a race condition that can lead to a permanently
>> stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
>> Max).
>>
>> The race occurs in veth_xmit(). The producer observes a full ptr_ring and
>> stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
>> intended to re-wake the queue if the consumer had just emptied it (if
>> (__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
>> "lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
>> traffic halts.
>>
>> This failure is caused by an incorrect use of the __ptr_ring_empty() API
>> from the producer side. As noted in kernel comments, this check is not
>> guaranteed to be correct if a consumer is operating on another CPU. The
>> empty test is based on ptr_ring->consumer_head, making it reliable only for
>> the consumer. Using this check from the producer side is fundamentally racy.
>>
>> This patch fixes the race by adopting the more robust logic from an earlier
>> version V4 of the patchset, which always flushed the peer:
>>
>> (1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
>> are removed. Instead, after stopping the queue, we unconditionally call
>> __veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
>> making it solely responsible for re-waking the TXQ.
> 
> This makes sense.
> 
>> (2) On the consumer side, the logic for waking the peer TXQ is centralized.
>> It is moved out of veth_xdp_rcv() (which processes a batch) and placed at
>> the end of the veth_poll() function. This ensures netif_tx_wake_queue() is
>> called once per complete NAPI poll cycle.
> 
> So is this second point strictly necessary to fix the race, or is it
> more of an optimisation?
> 

IMHO it is strictly necessary to fix the race.  The wakeup check
netif_tx_queue_stopped() in veth_poll() needs to be after the code that
(potentially) writes rx_notify_masked.

This handles the race where veth_xmit() haven't called
netif_tx_stop_queue() yet, but veth_poll() manage to consume all packets
and stopped NAPI.  Then we know that __veth_xdp_flush(rq) in veth_xmit()
will see rx_notify_masked==false and start NAPI/veth_poll() again, and
even-though there is no packets left to process we still hit the check
netif_tx_queue_stopped() which start txq and will allow veth_xmit() to
run again.

I'll see if I can improve the description for (2).

>> (3) Finally, the NAPI completion check in veth_poll() is updated. If NAPI is
>> about to complete (napi_complete_done), it now also checks if the peer TXQ
>> is stopped. If the ring is empty but the peer TXQ is stopped, NAPI will
>> reschedule itself. This prevents a new race where the producer stops the
>> queue just as the consumer is finishing its poll, ensuring the wakeup is not
>> missed.
> 
> Also makes sense...
> 
>> Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
>> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
>> ---
>>   drivers/net/veth.c |   42 +++++++++++++++++++++---------------------
>>   1 file changed, 21 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 3976ddda5fb8..1d70377481eb 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -392,14 +392,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>>   		}
>>   		/* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
>>   		__skb_push(skb, ETH_HLEN);
>> -		/* Depend on prior success packets started NAPI consumer via
>> -		 * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped,
>> -		 * paired with empty check in veth_poll().
>> -		 */
>>   		netif_tx_stop_queue(txq);
>> -		smp_mb__after_atomic();
>> -		if (unlikely(__ptr_ring_empty(&rq->xdp_ring)))
>> -			netif_tx_wake_queue(txq);
>> +		/* Handle race: Makes sure NAPI peer consumer runs. Consumer is
>> +		 * responsible for starting txq again, until then ndo_start_xmit
>> +		 * (this function) will not be invoked by the netstack again.
>> +		 */
>> +		__veth_xdp_flush(rq);
> 
> Nit: I'd lose the "Handle race:" prefix from the comment; the rest of
> the comment is clear enough without it, and since there's no explanation
> of *which* race is being handled, it is just confusing, IMO.

Good point, I will remove prefix.

>>   		break;
>>   	case NET_RX_DROP: /* same as NET_XMIT_DROP */
>>   drop:
>> @@ -900,17 +898,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   			struct veth_xdp_tx_bq *bq,
>>   			struct veth_stats *stats)
>>   {
>> -	struct veth_priv *priv = netdev_priv(rq->dev);
>> -	int queue_idx = rq->xdp_rxq.queue_index;
>> -	struct netdev_queue *peer_txq;
>> -	struct net_device *peer_dev;
>>   	int i, done = 0, n_xdpf = 0;
>>   	void *xdpf[VETH_XDP_BATCH];
>>   
>> -	/* NAPI functions as RCU section */
>> -	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> -	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> -
>>   	for (i = 0; i < budget; i++) {
>>   		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
>>   
>> @@ -959,11 +949,6 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>   	rq->stats.vs.xdp_packets += done;
>>   	u64_stats_update_end(&rq->stats.syncp);
>>   
>> -	if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) {
>> -		txq_trans_cond_update(peer_txq);
>> -		netif_tx_wake_queue(peer_txq);
>> -	}
>> -
>>   	return done;
>>   }
>>   
>> @@ -971,12 +956,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   {
>>   	struct veth_rq *rq =
>>   		container_of(napi, struct veth_rq, xdp_napi);
>> +	struct veth_priv *priv = netdev_priv(rq->dev);
>> +	int queue_idx = rq->xdp_rxq.queue_index;
>> +	struct netdev_queue *peer_txq;
>>   	struct veth_stats stats = {};
>> +	struct net_device *peer_dev;
>>   	struct veth_xdp_tx_bq bq;
>>   	int done;
>>   
>>   	bq.count = 0;
>>   
>> +	/* NAPI functions as RCU section */
>> +	peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held());
>> +	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>> +
>>   	xdp_set_return_frame_no_direct();
>>   	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>>   
>> @@ -986,7 +979,8 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   	if (done < budget && napi_complete_done(napi, done)) {
>>   		/* Write rx_notify_masked before reading ptr_ring */
>>   		smp_store_mb(rq->rx_notify_masked, false);
>> -		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>> +		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring) ||
>> +			     (peer_txq && netif_tx_queue_stopped(peer_txq)))) {
>>   			if (napi_schedule_prep(&rq->xdp_napi)) {
>>   				WRITE_ONCE(rq->rx_notify_masked, true);
>>   				__napi_schedule(&rq->xdp_napi);
>> @@ -998,6 +992,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
>>   		veth_xdp_flush(rq, &bq);
>>   	xdp_clear_return_frame_no_direct();
>>   
>> +	/* Release backpressure per NAPI poll */
>> +	if (peer_txq && netif_tx_queue_stopped(peer_txq)) {
>> +		txq_trans_cond_update(peer_txq);
>> +		netif_tx_wake_queue(peer_txq);
>> +	}
>> +
>>   	return done;
>>   }
>>   
>