[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220517151404.vqse5tampdsaaeji@sgarzare-redhat>
Date: Tue, 17 May 2022 17:14:04 +0200
From: Stefano Garzarella <sgarzare@...hat.com>
To: Arseniy Krasnov <AVKrasnov@...rdevices.ru>
Cc: Stefan Hajnoczi <stefanha@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Jason Wang <jasowang@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"virtualization@...ts.linux-foundation.org"
<virtualization@...ts.linux-foundation.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
kernel <kernel@...rdevices.ru>
Subject: Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
Hi Arseniy,
On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
> INTRODUCTION
>
> Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
Sounds cool, but maybe we would need some socket/net experts here for
review.
Could we do something similar for the sending path as well?
>
> DETAILS
>
> Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
> struct virtio_vsock_usr_hdr {
> uint32_t length;
> uint32_t flags;
> };
>
>Field 'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). All other pages are data pages from RX queue.
>
> Page 0 Page 1 Page N
>
> [ hdr1 .. hdrN ][ data ] .. [ data ]
> | | ^ ^
> | | | |
> | *-------------------*
> | |
> | |
> *----------------*
>
> Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).
>
> Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
> Page 1: [ 56 ]
> Page 2: [ 4096 ]
> Page 3: [ 4096 ]
> Page 4: [ 4096 ]
> Page 5: [ 8 ]
>
> Page 0 contains only array of headers:
> 'hdr0' has 56 in length field.
> 'hdr1' has 4096 in length field.
> 'hdr2' has 8200 in length field.
> 'hdr3' has 0 in length field(this is end of data marker).
>
> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> Page 2 corresponds to 'hdr1' and filled with data.
> Page 3 corresponds to 'hdr2' and filled with data.
> Page 4 corresponds to 'hdr2' and filled with data.
> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
> This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
> TESTS
>
> This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.
Thanks for adding the test!
>
> BENCHMARKING
>
> For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
> size of rx/tx buffers in pages
> *---------------------------------------------------*
> | 8 | 32 | 64 | 256 | 512 |
>*--------------*--------*----------*---------*----------*----------*
>| zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
>*--------------*----------------------------------------------------
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.
Why is it twice as slow in the first column?
>
> PROBLEMS
>
> Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:
Can we somehow copy the incoming packets into the payload of the already
queued packet?
This reminds me that we have yet to fix a similar problem with kmalloc()
as well...
https://bugzilla.kernel.org/show_bug.cgi?id=215329
> 1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used.
That sounds reasonable to me, I guess also TCP needs a setsockopt() call
to enable the feature, right?
> 2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.
It seems more difficult to me, but at the same time doable. I would go
more on option 1, though.
>
>P.S: of course this is experimental RFC, so what do You think guys?
It seems cool :-)
But I would like some feedback from the net guys to have some TCP-like
things.
Thanks,
Stefano
Powered by blists - more mailing lists