[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJaqyWcNNFRnFmmkEHhOPGWAL05P1EO1ebMJY8+YUC0jxyq3hg@mail.gmail.com>
Date: Thu, 15 May 2025 09:52:12 +0200
From: Eugenio Perez Martin <eperezma@...hat.com>
To: Jon Kohler <jon@...anix.com>
Cc: Jason Wang <jasowang@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Stefano Brivio <sbrivio@...hat.com>
Subject: Re: [PATCH] vhost/net: remove zerocopy support
On Mon, May 12, 2025 at 5:21 PM Jon Kohler <jon@...anix.com> wrote:
>
>
>
> > On Apr 30, 2025, at 9:21 PM, Jon Kohler <jon@...anix.com> wrote:
> >
> >
> >
> >> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <eperezma@...hat.com> wrote:
> >>
> >> !-------------------------------------------------------------------|
> >> CAUTION: External Email
> >>
> >> |-------------------------------------------------------------------!
> >>
> >> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasowang@...hat.com> wrote:
> >>>
> >>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <jon@...anix.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasowang@...hat.com> wrote:
> >>>>>
> >>>>> !-------------------------------------------------------------------|
> >>>>> CAUTION: External Email
> >>>>>
> >>>>> |-------------------------------------------------------------------!
> >>>>>
> >>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <jon@...anix.com> wrote:
> >>>>>>
> >>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
> >>>>>> the module parameter for the handle_tx_zerocopy path back in 2019,
> >>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had
> >>>>>> already done the same.
> >>>>>>
> >>>>>> Both upstream and downstream disablement suggest this path is rarely
> >>>>>> used.
> >>>>>>
> >>>>>> Testing the module parameter shows that while the path allows packet
> >>>>>> forwarding, the zerocopy functionality itself is broken. On outbound
> >>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
> >>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
> >>>>>
> >>>>> This is by design to avoid DOS.
> >>>>
> >>>> I understand that, but it makes ZC non-functional in general, as ZC fails
> >>>> and immediately increments the error counters.
> >>>
> >>> The main issue is HOL, but zerocopy may still work in some setups that
> >>> don't need to care about HOL. One example the macvtap passthrough
> >>> mode.
> >>>
> >>>>
> >>>>>
> >>>>>> or
> >>>>>> skb_orphan_frags() elsewhere in the stack,
> >>>>>
> >>>>> Basically zerocopy is expected to work for guest -> remote case, so
> >>>>> could we still hit skb_orphan_frags() in this case?
> >>>>
> >>>> Yes, you’d hit that in tun_net_xmit().
> >>>
> >>> Only for local VM to local VM communication.
> >
> > Sure, but the tricky bit here is that if you have a mix of VM-VM and VM-external
> > traffic patterns, any time the error path is hit, the zc error counter will go up.
> >
> > When that happens, ZC will get silently disabled anyhow, so it leads to sporadic
> > success / non-deterministic performance.
> >
> >>>
> >>>> If you punch a hole in that *and* in the
> >>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you get ZC
> >>>> from vhost; however, the network interrupt handler under net_tx_action and
> >>>> eventually incurs the memcpy under dev_queue_xmit_nit().
> >>>
> >>> Well, yes, we need a copy if there's a packet socket. But if there's
> >>> no network interface taps, we don't need to do the copy here.
> >>>
> >
> > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that prevented
> > this specific case; however, there are still other trip wires spread out across the
> > stack that would need to be addressed.
> >
> > [1] https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717
> >
> >>
> >> Hi!
> >>
> >> I need more time diving into the issues. As Jon mentioned, vhost ZC is
> >> so little used I didn't have the chance to experiment with this until
> >> now :). But yes, I expect to be able to overcome these for pasta, by
> >> adapting buffer sizes or modifying code etc.
> >
> > Another tricky bit here is that it has been disabled both upstream and downstream
> > for so long, the code naturally has a bit of wrench-in-the-engine.
> >
> > RE Buffer sizes: I tried this as well, because I think on sufficiently fast systems,
> > zero copy gets especially interesting in GSO/TSO cases where you have mega
> > payloads.
> >
> > I tried playing around with the good copy value such that ZC restricted itself to
> > only lets say 32K payloads and above, and while it *does* work (with enough
> > holes punched in), absolute t-put doesn’t actually go up, its just that CPU utilization
> > goes down a pinch. Not a bad thing for certain, but still not great.
> >
> > In fact, I found that tput actually went down with this path, even with ZC occurring
> > successfully, as there was still a mix of ZC and non-ZC because you can only
> > have so many pending at any given time before the copy path kicks in again.
> >
> >
> >>
> >>>>
> >>>> This is no more performant, and in fact is actually worse since the time spent
> >>>> waiting on that memcpy to resolve is longer.
> >>>>
> >>>>>
> >>>>>> as vhost_net does not set
> >>>>>> SKBFL_DONT_ORPHAN.
> >>>
> >>> Maybe we can try to set this as vhost-net can hornor ulimit now.
> >
> > Yea I tried that, and while it helps kick things further down the stack, its not actually
> > faster in any testing I’ve drummed up.
> >
> >>>
> >>>>>>
> >>>>>> Orphaning enforces a memcpy and triggers the completion callback, which
> >>>>>> increments the failed TX counter, effectively disabling zerocopy again.
> >>>>>>
> >>>>>> Even after addressing these issues to prevent SKB orphaning and error
> >>>>>> counter increments, performance remains poor. By default, only 64
> >>>>>> messages can be zerocopied, which is immediately exhausted by workloads
> >>>>>> like iperf, resulting in most messages being memcpy'd anyhow.
> >>>>>>
> >>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching
> >>>>>> optimizations present in the handle_tx_copy path.
> >>>>>>
> >>>>>> Given these limitations and the lack of any tangible benefits, remove
> >>>>>> zerocopy entirely to simplify the code base.
> >>>>>>
> >>>>>> Signed-off-by: Jon Kohler <jon@...anix.com>
> >>>>>
> >>>>> Any chance we can fix those issues? Actually, we had a plan to make
> >>>>> use of vhost-net and its tx zerocopy (or even implement the rx
> >>>>> zerocopy) in pasta.
> >>>>
> >>>> Happy to take direction and ideas here, but I don’t see a clear way to fix these
> >>>> issues, without dealing with the assertions that skb_orphan_frags_rx calls out.
> >>>>
> >>>> Said another way, I’d be interested in hearing if there is a config where ZC in
> >>>> current host-net implementation works, as I was driving myself crazy trying to
> >>>> reverse engineer.
> >>>
> >>> See above.
> >>>
> >>>>
> >>>> Happy to collaborate if there is something we could do here.
> >>>
> >>> Great, we can start here by seeking a way to fix the known issues of
> >>> the vhost-net zerocopy code.
> >>>
> >>
> >> Happy to help here :).
> >>
> >> Jon, could you share more details about the orphan problem so I can
> >> speed up the help? For example, can you describe the code changes and
> >> the code path that would lead to that assertion of
> >> skb_orphan_frags_rx?
> >>
> >> Thanks!
> >>
> >
> > Sorry for the slow response, getting back from holiday and catching up.
> >
> > When running through tun.c, there are a handful of places where ZC turns into
> > a full copy, whether that is in the tun code itself, or in the interrupt handler when
> > tun xmit is running.
> >
> > For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything
> > with frags will get this memcpy, which are of course the “juicy” targets here as
> > they would take up the most memory bandwidth in general. Nasty catch22 :)
> >
> > There are also plenty of places that call normal skb_orphan_frags, which
> > triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy
> > fix, but still something to think about.
> >
> > Then there is the issue of packet sockets, which throw a king sized wrench into
> > this. Its slightly insidious, but it isn’t directly apparent that loading some user
> > space app nukes zero copy, but it happens.
> >
> > See my previous comment about LLDPD, where a simply compiler snafu caused
> > one socket option to get silently break, and it then ripped out ZC capability. Easy
> > fix, but its an example of how this can fall over.
> >
> > Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to ideas
> > here :) (up to and including both A) fixing it and B) deleting it)
>
> Hey Eugenio - wondering if you had a chance to check out my notes on this?
>
Sorry I thought I was going to have the code ready by now :). I'll
need more time to go through the items.
Powered by blists - more mailing lists