[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACGkMEt_Z0a3kidmmjXcajU2EVi-B6mi832weeumPfZzmLoEPA@mail.gmail.com>
Date: Tue, 25 Nov 2025 09:34:48 +0800
From: Jason Wang <jasowang@...hat.com>
To: Simon Schippers <simon.schippers@...dortmund.de>
Cc: willemdebruijn.kernel@...il.com, andrew+netdev@...n.ch,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
mst@...hat.com, eperezma@...hat.com, jon@...anix.com,
tim.gebauer@...dortmund.de, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
virtualization@...ts.linux.dev
Subject: Re: [PATCH net-next v6 0/8] tun/tap & vhost-net: netdev queue flow
control to avoid ptr_ring tail drop
On Mon, Nov 24, 2025 at 5:20 PM Simon Schippers
<simon.schippers@...dortmund.de> wrote:
>
> On 11/24/25 02:04, Jason Wang wrote:
> > On Fri, Nov 21, 2025 at 5:23 PM Simon Schippers
> > <simon.schippers@...dortmund.de> wrote:
> >>
> >> On 11/21/25 07:19, Jason Wang wrote:
> >>> On Thu, Nov 20, 2025 at 11:30 PM Simon Schippers
> >>> <simon.schippers@...dortmund.de> wrote:
> >>>>
> >>>> This patch series deals with tun/tap and vhost-net which drop incoming
> >>>> SKBs whenever their internal ptr_ring buffer is full. Instead, with this
> >>>> patch series, the associated netdev queue is stopped before this happens.
> >>>> This allows the connected qdisc to function correctly as reported by [1]
> >>>> and improves application-layer performance, see our paper [2]. Meanwhile
> >>>> the theoretical performance differs only slightly:
> >>>>
> >>>> +--------------------------------+-----------+----------+
> >>>> | pktgen benchmarks to Debian VM | Stock | Patched |
> >>>> | i5 6300HQ, 20M packets | | |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP | Transmitted | 195 Kpps | 183 Kpps |
> >>>> | +--------------+-----------+----------+
> >>>> | | Lost | 1615 Kpps | 0 pps |
> >>>> +-----------------+--------------+-----------+----------+
> >>>> | TAP+vhost_net | Transmitted | 589 Kpps | 588 Kpps |
> >>>> | +--------------+-----------+----------+
> >>>> | | Lost | 1164 Kpps | 0 pps |
> >>>> +-----------------+--------------+-----------+----------+
> >>>
> >>
> >> Hi Jason,
> >>
> >> thank you for your reply!
> >>
> >>> PPS drops somehow for TAP, any reason for that?
> >>
> >> I have no explicit explanation for that except general overheads coming
> >> with this implementation.
> >
> > It would be better to fix that.
> >
> >>
> >>>
> >>> Btw, I had some questions:
> >>>
> >>> 1) most of the patches in this series would introduce non-trivial
> >>> impact on the performance, we probably need to benchmark each or split
> >>> the series. What's more we need to run TCP benchmark
> >>> (throughput/latency) as well as pktgen see the real impact
> >>
> >> What could be done, IMO, is to activate tun_ring_consume() /
> >> tap_ring_consume() before enabling tun_ring_produce(). Then we could see
> >> if this alone drops performance.
> >>
> >> For TCP benchmarks, you mean userspace performance like iperf3 between a
> >> host and a guest system?
> >
> > Yes,
> >
> >>
> >>>
> >>> 2) I see this:
> >>>
> >>> if (unlikely(tun_ring_produce(&tfile->tx_ring, queue, skb))) {
> >>> drop_reason = SKB_DROP_REASON_FULL_RING;
> >>> goto drop;
> >>> }
> >>>
> >>> So there could still be packet drop? Or is this related to the XDP path?
> >>
> >> Yes, there can be packet drops after a ptr_ring resize or a ptr_ring
> >> unconsume. Since those two happen so rarely, I figured we should just
> >> drop in this case.
> >>
> >>>
> >>> 3) The LLTX change would have performance implications, but the
> >>> benmark doesn't cover the case where multiple transmission is done in
> >>> parallel
> >>
> >> Do you mean multiple applications that produce traffic and potentially
> >> run on different CPUs?
> >
> > Yes.
> >
> >>
> >>>
> >>> 4) After the LLTX change, it seems we've lost the synchronization with
> >>> the XDP_TX and XDP_REDIRECT path?
> >>
> >> I must admit I did not take a look at XDP and cannot really judge if/how
> >> lltx has an impact on XDP. But from my point of view, __netif_tx_lock()
> >> instead of __netif_tx_acquire(), is executed before the tun_net_xmit()
> >> call and I do not see the impact for XDP, which calls its own methods.
> >
> > Without LLTX tun_net_xmit is protected by tx lock but it is not the
> > case of tun_xdp_xmit. This is because, unlike other devices, tun
> > doesn't have a dedicated TX queue for XDP, so the queue is shared by
> > both XDP and skb. So XDP xmit path needs to be protected with tx lock
> > as well, and since we don't have queue discipline for XDP, it means we
> > could still drop packets when XDP is enabled. I'm not sure this would
> > defeat the whole idea or not.
>
> Good point.
>
> >
> >>>
> >>> 5) The series introduces various ptr_ring helpers with lots of
> >>> ordering stuff which is complicated, I wonder if we first have a
> >>> simple patch to implement the zero packet loss
> >>
> >> I personally don't see how a simpler patch is possible without using
> >> discouraged practices like returning NETDEV_TX_BUSY in tun_net_xmit or
> >> spin locking between producer and consumer. But I am open for
> >> suggestions :)
> >
> > I see NETDEV_TX_BUSY is used by veth:
> >
> > static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> > {
> > if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> > return NETDEV_TX_BUSY; /* signal qdisc layer */
> >
> > return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
> > }
> >
> > Maybe it would be simpler to start from that (probably with a new tun->flags?).
> >
> > Thanks
>
> Do you mean that this patchset could be implemented using the same
> approach that was used for veth in [1]?
> This could then also fix the XDP path.
I think so.
>
> But is returning NETDEV_TX_BUSY fine in our case?
If it helps to avoid packet drop. But I'm not sure if qdisc is a must
in your case.
>
> Do you mean a flag that enables or disables the no-drop behavior?
Yes, via a new flags that could be set via TUNSETIFF.
Thanks
>
> Thanks!
>
> [1] Link: https://lore.kernel.org/netdev/174559288731.827981.8748257839971869213.stgit@firesoul/T/#u
>
> >
> >>
> >>>
> >>>>
> >>>> This patch series includes tun/tap, and vhost-net because they share
> >>>> logic. Adjusting only one of them would break the others. Therefore, the
> >>>> patch series is structured as follows:
> >>>> 1+2: new ptr_ring helpers for 3
> >>>> 3: tun/tap: tun/tap: add synchronized ring produce/consume with queue
> >>>> management
> >>>> 4+5+6: tun/tap: ptr_ring wrappers and other helpers to be called by
> >>>> vhost-net
> >>>> 7: tun/tap & vhost-net: only now use the previous implemented functions to
> >>>> not break git bisect
> >>>> 8: tun/tap: drop get ring exports (not used anymore)
> >>>>
> >>>> Possible future work:
> >>>> - Introduction of Byte Queue Limits as suggested by Stephen Hemminger
> >>>
> >>> This seems to be not easy. The tx completion depends on the userspace behaviour.
> >>
> >> I agree, but I really would like to reduce the buffer bloat caused by the
> >> default 500 TUN / 1000 TAP packet queue without losing performance.
> >>
> >>>
> >>>> - Adaption of the netdev queue flow control for ipvtap & macvtap
> >>>>
> >>>> [1] Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
> >>>> [2] Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> >>>>
> >>>
> >>> Thanks
> >>>
> >>
> >> Thanks! :)
> >>
> >
>
Powered by blists - more mailing lists