netdev - Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e67776b-84d7-4976-8a97-82473cd63b06@kernel.org>
Date: Mon, 7 Apr 2025 14:42:37 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Toke Høiland-Jørgensen <toke@...hat.com>,
 netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>
Cc: bpf@...r.kernel.org, tom@...bertland.com,
 Eric Dumazet <eric.dumazet@...il.com>, "David S. Miller"
 <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
 kernel-team@...udflare.com, makita.toshiaki@....ntt.co.jp,
 Yan Zhai <yan@...udflare.com>, Jesse Brandeburg
 <jbrandeburg@...udflare.com>, "Michael S. Tsirkin" <mst@...hat.com>
Subject: Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full
 ptr_ring to reduce TX drops



On 07/04/2025 11.15, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@...nel.org> writes:
> 
>> In production, we're seeing TX drops on veth devices when the ptr_ring
>> fills up. This can occur when NAPI mode is enabled, though it's
>> relatively rare. However, with threaded NAPI - which we use in
>> production - the drops become significantly more frequent.
>>
>> The underlying issue is that with threaded NAPI, the consumer often runs
>> on a different CPU than the producer. This increases the likelihood of
>> the ring filling up before the consumer gets scheduled, especially under
>> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>>
>> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
>> ring is full, signaling the qdisc layer to requeue the packet. The txq
>> (netdev queue) is stopped in this condition and restarted once
>> veth_poll() drains entries from the ring, ensuring coordination between
>> NAPI and qdisc.
> 
> Right, I definitely agree that this is the right solution; having no
> backpressure and a fixed-size ringbuffer is obviously not ideal.
> 
>> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
>> the driver retains its original behavior - dropping packets immediately
>> when the ring is full. This avoids unexpected behavior changes in setups
>> without a configured qdisc.
> 
> Not sure I like this bit, though; see below.
> 

I agree, but the reason for being chicken here is my prod deployment plan.
I'm about to roll this change out to production, and it is practical
that we can do a roll-back based on removing the qdisc.

In the future, we (netdev community) should consider changing veth to
automatically get the default qdisc attached, when enabling NAPI mode.
I will propose this when I have more data from production.

The ptr_ring overflow case is very easy to reproduce, even on testlab
servers without any competing traffic.
Together with Yan (Cc) we/I have a reproducer script available here:

 
https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh


>> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
>> (AQM) to fairly schedule packets across flows and reduce collateral
>> damage from elephant flows.
>>
>> A known limitation of this approach is that the full ring sits in front
>> of the qdisc layer, effectively forming a FIFO buffer that introduces
>> base latency. While AQM still improves fairness and mitigates flow
>> dominance, the latency impact is measurable.
>>
>> In hardware drivers, this issue is typically addressed using BQL (Byte
>> Queue Limits), which tracks in-flight bytes needed based on physical link
>> rate. However, for virtual drivers like veth, there is no fixed bandwidth
>> constraint - the bottleneck is CPU availability and the scheduler's ability
>> to run the NAPI thread. It is unclear how effective BQL would be in this
>> context.
> 
> So the BQL algorithm tries to tune the maximum number of outstanding
> bytes to be ~twice the maximum that can be completed in one batch. Since
> we're not really limited by bytes in the same sense here (as you point
> out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
> as a first approximation, we could have veth stop the queue once the
> ringbuffer has 2x the NAPI budget packets in it?
> 

I like the idea. (Note: The current ptr_ring size is 256.)

To implement this, we could simply reduce the ptr_ring size to 128 
(2*64), right?
IMHO would be cooler to have a larger queue, but dynamically stop it
once the ringbuffer has 2x the NAPI budget. but...

There are two subtle issue with the ptr_ring (Cc MST). (So, perhaps we
need choose another ring buffer API for veth?)

(1) We don't have a counter for how many elements are in the queue.
Most hardware drivers tried to stop the TXQ *before* their ring buffer
runs full, such that it avoids returning NETDEV_TX_BUSY.
(This is why it is hard to implement BQL for veth.)

(2) ptr_ring consumer delays making the slots available to the producer,
to avoiding bouncing cache-lines, in the ring-full case.  Thus, we
cannot make ptr_ring too small.  And I'm unsure how this interacts with
the idea of having two NAPI budgets available.


>> This patch serves as a first step toward addressing TX drops. Future work
>> may explore adapting a BQL-like mechanism to better suit virtual devices
>> like veth.
>>
>> Reported-by: Yan Zhai <yan@...udflare.com>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
> 
> [...]
> 
>> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
>> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
>> +{
>> +	struct Qdisc *q;
>> +
>> +	q = rcu_dereference(txq->qdisc);
>> +	if (q->enqueue)
>> +		return true;
>> +	else
>> +		return false;
>> +}
> 
> This seems like a pretty ugly layering violation, inspecting the qdisc
> like this in the driver?
> 

I agree. (as minimum it should be moved to include/net/sch_generic.h.)

I did considered using qdisc_tx_is_noop() defined in
include/net/sch_generic.h, but IMHO it is too slow for data-(fast)path
code as it walks all the TXQs.

(more below)

> AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
> emits a warning (looks like this, around line 4640 in dev.c):
> 
> 			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
> 					     dev->name);
> 
> As this patch shows, it can clearly be appropriate for a virtual device
> to stop the queue even if there's no qdisc, so how about we just get rid
> of that warning? Then this logic won't be needed at all in the driver..
> 

Yes, the real reason for the layer violation is to avoid this warning in
_dev_queue_xmit().

The NETDEV_TX_BUSY is (correctly) converted into a drop, when the
net_device TXQ doesn't have a qdisc attached.   Thus, we don't need the
driver layer violation, if we can remove this warning.

Does anyone object to removing this net_crit_ratelimited?

--Jesper