[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9ee9ef5-e707-65ff-3128-41d09fbe8655@sberdevices.ru>
Date: Wed, 3 May 2023 16:11:59 +0300
From: Arseniy Krasnov <avkrasnov@...rdevices.ru>
To: Stefano Garzarella <sgarzare@...hat.com>
CC: Stefan Hajnoczi <stefanha@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Jason Wang <jasowang@...hat.com>,
Bobby Eshleman <bobby.eshleman@...edance.com>,
<kvm@...r.kernel.org>, <virtualization@...ts.linux-foundation.org>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<kernel@...rdevices.ru>, <oxffffaa@...il.com>
Subject: Re: [RFC PATCH v2 00/15] vsock: MSG_ZEROCOPY flag support
On 03.05.2023 15:52, Stefano Garzarella wrote:
> Hi Arseniy,
> Sorry for the delay, but I have been very busy.
Hello, no problem!
>
> I can't apply this series on master or net-next, can you share with me
> the base commit?
Here is my base:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
>
> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>> flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>> for single tx syscall (or it can merge several completions to single
>> one). I used already implemented logic for MSG_ZEROCOPY support:
>> 'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created, then I call 'pin_user_pages()' for each page
>> from user's iov iterator and add each returned page to the skb as fragment.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>> is difficult) to check whether transport is able to handle it or not
>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>> are not considered to be called from each other. So in current version
>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>> tx routine will fail with EOPNOTSUPP.
>
> Do you plan to fix this in the next versions?
>
> If it is too complicated, I think we can have this limitation until we
> find a good solution.
>
I'll try to fix it again, but just didn't pay attention on it in v2.
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>> one completion. In each completion there is flag which shows how tx
>> was performed: zerocopy or copy. This leads that whole message must
>> be send in zerocopy or copy way - we can't send part of message with
>> copying and rest of message with zerocopy mode (or vice versa). Now,
>> we need to account vsock credit logic, e.g. we can't send whole data
>> once - only allowed number of bytes could sent at any moment. In case
>> of copying way there is no problem as in worst case we can send single
>> bytes, but zerocopy is more complex because smallest transmission
>> unit is single page. So if there is not enough space at peer's side
>> to send integer number of pages (at least one) - we will wait, thus
>> stalling tx side. To overcome this problem i've added simple rule -
>> zerocopy is possible only when there is enough space at another side
>> for whole message (to check, that current 'msghdr' was already used
>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
> Right?
Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
MSG_ZEROCOPY flag as used). But completion structure contains information that
there was copying instead of zerocopying.
>
> If it is the case it seems reasonable to me.
>
>>
>> 3) loopback transport is not supported, because it requires to implement
>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>> and "receive" it from the same queue). I'm going to implement it in
>> next versions.
>>
>> ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>> not touched for data - user space pages are used as buffers. Also
>> this limit trims every message which is > 64KB, thus such messages
>> will be send in copy mode due to 'iov_offset' check in 2).
>>
>> ^^^ fixed in v2
>>
>> PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>> PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it something like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> G2H transmission (values are Gbit/s):
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 3 | 10 |
>> *-------------------------------*
>> | 32KB | 9 | 45 |
>> *-------------------------------*
>> | 256KB | 24 | 195 |
>> *-------------------------------*
>> | 1M | 27 | 270 |
>> *-------------------------------*
>> | 8M | 22 | 277 |
>> *-------------------------------*
>>
>> H2G:
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 17 | 11 |
>
> Do you know why in this case zerocopy is slower in this case?
> Could be the cost of pin/unpin pages?
May be, i think i need to analyze such enormous difference more. Also about
pin/unpin: i found that there is already implemented function to fill non-linear
skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
>
>> *-------------------------------*
>> | 32KB | 30 | 66 |
>> *-------------------------------*
>> | 256KB | 38 | 179 |
>> *-------------------------------*
>> | 1M | 38 | 234 |
>> *-------------------------------*
>> | 8M | 28 | 279 |
>> *-------------------------------*
>>
>> Loopback:
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 8 | 7 |
>> *-------------------------------*
>> | 32KB | 34 | 42 |
>> *-------------------------------*
>> | 256KB | 43 | 83 |
>> *-------------------------------*
>> | 1M | 40 | 109 |
>> *-------------------------------*
>> | 8M | 40 | 171 |
>> *-------------------------------*
>>
>> I suppose that huge difference above between both modes has two reasons:
>> 1) We don't need to copy data.
>> 2) We don't need to allocate buffer for data, only for header.
>>
>> Zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment.
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation.
>
> What the user needs to do? Increase it?
>
Yes, i'll update it.
>>
>> TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages). I also
>> run tests with loopback transport and running vsockmon.
>
> Thanks for the test again :-)
>
> This cover letter is very good, with a lot of details, but please add
> more details in each single patch, explaining the reason of the changes,
> otherwise it is very difficult to review, because it is a very big
> change.
>
> I'll do a per-patch review in the next days.
Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
Thanks, Arseniy
>
> Thanks,
> Stefano
>
Powered by blists - more mailing lists