[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B32E2C5D-25FB-427F-8567-701C152DFDE6@nutanix.com>
Date: Tue, 8 Apr 2025 01:18:09 +0000
From: Jon Kohler <jon@...anix.com>
To: Jason Wang <jasowang@...hat.com>
CC: "Michael S. Tsirkin" <mst@...hat.com>,
Eugenio Pérez
<eperezma@...hat.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] vhost/net: remove zerocopy support
> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasowang@...hat.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <jon@...anix.com> wrote:
>>
>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
>> the module parameter for the handle_tx_zerocopy path back in 2019,
>> nothing that many downstream distributions (e.g., RHEL7 and later) had
>> already done the same.
>>
>> Both upstream and downstream disablement suggest this path is rarely
>> used.
>>
>> Testing the module parameter shows that while the path allows packet
>> forwarding, the zerocopy functionality itself is broken. On outbound
>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
>
> This is by design to avoid DOS.
I understand that, but it makes ZC non-functional in general, as ZC fails
and immediately increments the error counters.
>
>> or
>> skb_orphan_frags() elsewhere in the stack,
>
> Basically zerocopy is expected to work for guest -> remote case, so
> could we still hit skb_orphan_frags() in this case?
Yes, you’d hit that in tun_net_xmit(). If you punch a hole in that *and* in the
zc error counter (such that failed ZC doesn’t disable ZC in vhost), you get ZC
from vhost; however, the network interrupt handler under net_tx_action and
eventually incurs the memcpy under dev_queue_xmit_nit().
This is no more performant, and in fact is actually worse since the time spent
waiting on that memcpy to resolve is longer.
>
>> as vhost_net does not set
>> SKBFL_DONT_ORPHAN.
>>
>> Orphaning enforces a memcpy and triggers the completion callback, which
>> increments the failed TX counter, effectively disabling zerocopy again.
>>
>> Even after addressing these issues to prevent SKB orphaning and error
>> counter increments, performance remains poor. By default, only 64
>> messages can be zerocopied, which is immediately exhausted by workloads
>> like iperf, resulting in most messages being memcpy'd anyhow.
>>
>> Additionally, memcpy'd messages do not benefit from the XDP batching
>> optimizations present in the handle_tx_copy path.
>>
>> Given these limitations and the lack of any tangible benefits, remove
>> zerocopy entirely to simplify the code base.
>>
>> Signed-off-by: Jon Kohler <jon@...anix.com>
>
> Any chance we can fix those issues? Actually, we had a plan to make
> use of vhost-net and its tx zerocopy (or even implement the rx
> zerocopy) in pasta.
Happy to take direction and ideas here, but I don’t see a clear way to fix these
issues, without dealing with the assertions that skb_orphan_frags_rx calls out.
Said another way, I’d be interested in hearing if there is a config where ZC in
current host-net implementation works, as I was driving myself crazy trying to
reverse engineer.
Happy to collaborate if there is something we could do here.
>
> Eugenio may explain more here.
>
> Thanks
>
Powered by blists - more mailing lists