[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <92470838-2B98-4FC6-8E5B-A8AF14965D4C@nutanix.com>
Date: Thu, 1 May 2025 01:21:30 +0000
From: Jon Kohler <jon@...anix.com>
To: Eugenio Perez Martin <eperezma@...hat.com>
CC: Jason Wang <jasowang@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"virtualization@...ts.linux.dev"
<virtualization@...ts.linux.dev>,
"netdev@...r.kernel.org"
<netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>,
Stefano Brivio <sbrivio@...hat.com>
Subject: Re: [PATCH] vhost/net: remove zerocopy support
> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <eperezma@...hat.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasowang@...hat.com> wrote:
>>
>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <jon@...anix.com> wrote:
>>>
>>>
>>>
>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasowang@...hat.com> wrote:
>>>>
>>>> !-------------------------------------------------------------------|
>>>> CAUTION: External Email
>>>>
>>>> |-------------------------------------------------------------------!
>>>>
>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <jon@...anix.com> wrote:
>>>>>
>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
>>>>> the module parameter for the handle_tx_zerocopy path back in 2019,
>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had
>>>>> already done the same.
>>>>>
>>>>> Both upstream and downstream disablement suggest this path is rarely
>>>>> used.
>>>>>
>>>>> Testing the module parameter shows that while the path allows packet
>>>>> forwarding, the zerocopy functionality itself is broken. On outbound
>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
>>>>
>>>> This is by design to avoid DOS.
>>>
>>> I understand that, but it makes ZC non-functional in general, as ZC fails
>>> and immediately increments the error counters.
>>
>> The main issue is HOL, but zerocopy may still work in some setups that
>> don't need to care about HOL. One example the macvtap passthrough
>> mode.
>>
>>>
>>>>
>>>>> or
>>>>> skb_orphan_frags() elsewhere in the stack,
>>>>
>>>> Basically zerocopy is expected to work for guest -> remote case, so
>>>> could we still hit skb_orphan_frags() in this case?
>>>
>>> Yes, you’d hit that in tun_net_xmit().
>>
>> Only for local VM to local VM communication.
Sure, but the tricky bit here is that if you have a mix of VM-VM and VM-external
traffic patterns, any time the error path is hit, the zc error counter will go up.
When that happens, ZC will get silently disabled anyhow, so it leads to sporadic
success / non-deterministic performance.
>>
>>> If you punch a hole in that *and* in the
>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you get ZC
>>> from vhost; however, the network interrupt handler under net_tx_action and
>>> eventually incurs the memcpy under dev_queue_xmit_nit().
>>
>> Well, yes, we need a copy if there's a packet socket. But if there's
>> no network interface taps, we don't need to do the copy here.
>>
Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that prevented
this specific case; however, there are still other trip wires spread out across the
stack that would need to be addressed.
[1] https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717
>
> Hi!
>
> I need more time diving into the issues. As Jon mentioned, vhost ZC is
> so little used I didn't have the chance to experiment with this until
> now :). But yes, I expect to be able to overcome these for pasta, by
> adapting buffer sizes or modifying code etc.
Another tricky bit here is that it has been disabled both upstream and downstream
for so long, the code naturally has a bit of wrench-in-the-engine.
RE Buffer sizes: I tried this as well, because I think on sufficiently fast systems,
zero copy gets especially interesting in GSO/TSO cases where you have mega
payloads.
I tried playing around with the good copy value such that ZC restricted itself to
only lets say 32K payloads and above, and while it *does* work (with enough
holes punched in), absolute t-put doesn’t actually go up, its just that CPU utilization
goes down a pinch. Not a bad thing for certain, but still not great.
In fact, I found that tput actually went down with this path, even with ZC occurring
successfully, as there was still a mix of ZC and non-ZC because you can only
have so many pending at any given time before the copy path kicks in again.
>
>>>
>>> This is no more performant, and in fact is actually worse since the time spent
>>> waiting on that memcpy to resolve is longer.
>>>
>>>>
>>>>> as vhost_net does not set
>>>>> SKBFL_DONT_ORPHAN.
>>
>> Maybe we can try to set this as vhost-net can hornor ulimit now.
Yea I tried that, and while it helps kick things further down the stack, its not actually
faster in any testing I’ve drummed up.
>>
>>>>>
>>>>> Orphaning enforces a memcpy and triggers the completion callback, which
>>>>> increments the failed TX counter, effectively disabling zerocopy again.
>>>>>
>>>>> Even after addressing these issues to prevent SKB orphaning and error
>>>>> counter increments, performance remains poor. By default, only 64
>>>>> messages can be zerocopied, which is immediately exhausted by workloads
>>>>> like iperf, resulting in most messages being memcpy'd anyhow.
>>>>>
>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching
>>>>> optimizations present in the handle_tx_copy path.
>>>>>
>>>>> Given these limitations and the lack of any tangible benefits, remove
>>>>> zerocopy entirely to simplify the code base.
>>>>>
>>>>> Signed-off-by: Jon Kohler <jon@...anix.com>
>>>>
>>>> Any chance we can fix those issues? Actually, we had a plan to make
>>>> use of vhost-net and its tx zerocopy (or even implement the rx
>>>> zerocopy) in pasta.
>>>
>>> Happy to take direction and ideas here, but I don’t see a clear way to fix these
>>> issues, without dealing with the assertions that skb_orphan_frags_rx calls out.
>>>
>>> Said another way, I’d be interested in hearing if there is a config where ZC in
>>> current host-net implementation works, as I was driving myself crazy trying to
>>> reverse engineer.
>>
>> See above.
>>
>>>
>>> Happy to collaborate if there is something we could do here.
>>
>> Great, we can start here by seeking a way to fix the known issues of
>> the vhost-net zerocopy code.
>>
>
> Happy to help here :).
>
> Jon, could you share more details about the orphan problem so I can
> speed up the help? For example, can you describe the code changes and
> the code path that would lead to that assertion of
> skb_orphan_frags_rx?
>
> Thanks!
>
Sorry for the slow response, getting back from holiday and catching up.
When running through tun.c, there are a handful of places where ZC turns into
a full copy, whether that is in the tun code itself, or in the interrupt handler when
tun xmit is running.
For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything
with frags will get this memcpy, which are of course the “juicy” targets here as
they would take up the most memory bandwidth in general. Nasty catch22 :)
There are also plenty of places that call normal skb_orphan_frags, which
triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy
fix, but still something to think about.
Then there is the issue of packet sockets, which throw a king sized wrench into
this. Its slightly insidious, but it isn’t directly apparent that loading some user
space app nukes zero copy, but it happens.
See my previous comment about LLDPD, where a simply compiler snafu caused
one socket option to get silently break, and it then ripped out ZC capability. Easy
fix, but its an example of how this can fall over.
Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to ideas
here :) (up to and including both A) fixing it and B) deleting it)
Powered by blists - more mailing lists