linux-kernel - [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow control to avoid ptr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e586a3b-d7fa-4bba-b8d1-39f002e20913@tu-dortmund.de>
Date: Tue, 25 Nov 2025 15:04:44 +0100
From: Simon Schippers <simon.schippers@...dortmund.de>
To: Jason Wang <jasowang@...hat.com>
Cc: willemdebruijn.kernel@...il.com, andrew+netdev@...n.ch,
        davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
        pabeni@...hat.com, mst@...hat.com, eperezma@...hat.com,
        jon@...anix.com, tim.gebauer@...dortmund.de, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        virtualization@...ts.linux.dev
Subject: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow
 control to avoid ptr_ring tail drop

On 11/25/25 02:34, Jason Wang wrote:
> On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
> <simon.schippers@...dortmund.de> wrote:
>>
>> On 11/24/25 02:04, Jason Wang wrote:
>>> On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
>>> <simon.schippers@...dortmund.de> wrote:
>>>>
>>>> On 11/21/25 07:19, Jason Wang wrote:
>>>>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
>>>>> <simon.schippers@...dortmund.de> wrote:
>>>>>>
>>>>>> This patch series deals with tun/tap and vhost-net which drop incoming
>>>>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
>>>>>> patch series, the associated netdev queue is stopped before this happens.
>>>>>> This allows the connected qdisc to function correctly as reported by [1]
>>>>>> and improves application-layer performance, see our paper [2]. Meanwhile
>>>>>> the theoretical performance differs only slightly:
>>>>>>
>>>>>> +--------------------------------+-----------+----------+
>>>>>> | pktgen benchmarks to Debian VM | Stock     | Patched  |
>>>>>> | i5 6300HQ, 20M packets         |           |          |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP             | Transmitted  | 195 Kpps  | 183 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1615 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>> | TAP+vhost_net   | Transmitted  | 589 Kpps  | 588 Kpps |
>>>>>> |                 +--------------+-----------+----------+
>>>>>> |                 | Lost         | 1164 Kpps | 0 pps    |
>>>>>> +-----------------+--------------+-----------+----------+
>>>>>
>>>>
>>>> Hi Jason,
>>>>
>>>> thank you for your reply!
>>>>
>>>>> PPS drops somehow for TAP, any reason for that?
>>>>
>>>> I have no explicit explanation for that except general overheads coming
>>>> with this implementation.
>>>
>>> It would be better to fix that.
>>>
>>>>
>>>>>
>>>>> Btw, I had some questions:
>>>>>
>>>>> 1) most of the patches in this series would introduce non-trivial
>>>>> impact on the performance, we probably need to benchmark each or split
>>>>> the series. What's more we need to run TCP benchmark
>>>>> (throughput/latency) as well as pktgen see the real impact
>>>>
>>>> What could be done, IMO, is to activate tun_ring_consume() /
>>>> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
>>>> if this alone drops performance.
>>>>
>>>> For TCP benchmarks, you mean userspace performance like iperf3 between a
>>>> host and a guest system?
>>>
>>> Yes,
>>>
>>>>
>>>>>
>>>>> 2) I see this:
>>>>>
>>>>>         if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
>>>>>                 drop_reason = SKB_DROP_REASON_FULL_RING;
>>>>>                 goto drop;
>>>>>         }
>>>>>
>>>>> So there could still be packet drop? Or is this related to the XDP path?
>>>>
>>>> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
>>>> unconsume. Since those two happen so rarely, I figured we should just
>>>> drop in this case.
>>>>
>>>>>
>>>>> 3) The LLTX change would have performance implications, but the
>>>>> benmark doesn't cover the case where multiple transmission is done in
>>>>> parallel
>>>>
>>>> Do you mean multiple applications that produce traffic and potentially
>>>> run on different CPUs?
>>>
>>> Yes.
>>>
>>>>
>>>>>
>>>>> 4) After the LLTX change, it seems we've lost the synchronization with
>>>>> the XDP_TX and XDP_REDIRECT path?
>>>>
>>>> I must admit I did not take a look at XDP and cannot really judge if/how
>>>> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
>>>> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
>>>> call and I do not see the impact for XDP, which calls its own methods.
>>>
>>> Without LLTX tun_net_xmit is protected by tx lock but it is not the
>>> case of tun_xdp_xmit. This is because, unlike other devices, tun
>>> doesn't have a dedicated TX queue for XDP, so the queue is shared by
>>> both XDP and skb. So XDP xmit path needs to be protected with tx lock
>>> as well, and since we don't have queue discipline for XDP, it means we
>>> could still drop packets when XDP is enabled. I'm not sure this would
>>> defeat the whole idea or not.
>>
>> Good point.
>>
>>>
>>>>>
>>>>> 5) The series introduces various ptr_ring helpers with lots of
>>>>> ordering stuff which is complicated, I wonder if we first have a
>>>>> simple patch to implement the zero packet loss
>>>>
>>>> I personally don't see how a simpler patch is possible without using
>>>> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
>>>> spin locking between producer and consumer. But I am open for
>>>> suggestions :)
>>>
>>> I see NETDEV_TX_BUSY is used by veth:
>>>
>>> static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
>>> {
>>>         if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
>>>                 return NETDEV_TX_BUSY; /* signal qdisc layer */
>>>
>>>         return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>>> }
>>>
>>> Maybe it would be simpler to start from that (probably with a new tun->flags?).
>>>
>>> Thanks
>>
>> Do you mean that this patchset could be implemented using the same
>> approach that was used for veth in [1]?
>> This could then also fix the XDP path.
> 
> I think so.

Okay, I will do so and submit a v7 when net-next opens again for 6.19.

> 
>>
>> But is returning NETDEV_TX_BUSY fine in our case?
> 
> If it helps to avoid packet drop. But I'm not sure if qdisc is a must
> in your case.

I will try to avoid returning it.

When no qdisc is connected, I will just drop like veth does.

> 
>>
>> Do you mean a flag that enables or disables the no-drop behavior?
> 
> Yes, via a new flags that could be set via TUNSETIFF.
> 
> Thanks

I am not a fan of that, since I can not imagine a use case where
dropping packets is desired. veth does not introduce a flag either.

Of course, if there is a major performance degradation, it makes sense.
But I will benchmark it, and we will see.

Thank you!

> 
>>
>> Thanks!
>>
>> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> This patch series includes tun/tap, and vhost-net because they share
>>>>>> logic. Adjusting only one of them would break the others. Therefore, the
>>>>>> patch series is structured as follows:
>>>>>> 1+2: new ptr_ring helpers for 3
>>>>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
>>>>>> management
>>>>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
>>>>>> vhost-net
>>>>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
>>>>>> not break git bisect
>>>>>> 8: tun/tap: drop get ring exports (not used anymore)
>>>>>>
>>>>>> Possible future work:
>>>>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
>>>>>
>>>>> This seems to be not easy. The tx completion depends on the userspace behaviour.
>>>>
>>>> I agree, but I really would like to reduce the buffer bloat caused by the
>>>> default 500 TUN / 1000 TAP packet queue without losing performance.
>>>>
>>>>>
>>>>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
>>>>>>
>>>>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
>>>>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>> Thanks! :)
>>>>
>>>
>>
>