[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220608155941.GA34797@liuzhao-OptiPlex-7080>
Date: Wed, 8 Jun 2022 23:59:41 +0800
From: Zhao Liu <zhao1.liu@...ux.intel.com>
To: Arseniy Krasnov <AVKrasnov@...rdevices.ru>
Cc: "linux-kernel-AT-vger.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm-AT-vger.kernel.org" <kvm@...r.kernel.org>,
"virtualization-AT-lists.linux-foundation.org"
<virtualization@...ts.linux-foundation.org>,
"netdev-AT-vger.kernel.org" <netdev@...r.kernel.org>,
kernel <kernel@...rdevices.ru>,
Krasnov Arseniy <oxffffaa@...il.com>,
Arseniy Krasnov <AVKrasnov@...rdevices.ru>,
Zhao Liu <zhao1.liu@...ux.intel.com>
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
> Date: Fri, 3 Jun 2022 05:27:56 +0000
> From: Arseniy Krasnov <AVKrasnov@...rdevices.ru>
> Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
>
> INTRODUCTION
>
> Hello, this is experimental implementation of virtio vsock zerocopy
> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
> fill provided vma area with pages of virtio RX buffers. After received data was
> processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>
> DETAILS
>
> Here is how mapping with mapped pages looks exactly: first page mapping
> contains array of trimmed virtio vsock packet headers (in contains only length
> of data on the corresponding page and 'flags' field):
>
> struct virtio_vsock_usr_hdr {
> uint32_t length;
> uint32_t flags;
> uint32_t copy_len;
> };
>
> Field 'length' allows user to know exact size of payload within each sequence
> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
> All other pages are data pages from RX queue.
>
> Page 0 Page 1 Page N
>
> [ hdr1 .. hdrN ][ data ] .. [ data ]
> | | ^ ^
> | | | |
> | *-------------------*
> | |
> | |
> *----------------*
>
> Of course, single header could represent array of pages (when packet's
> buffer is bigger than one page).So here is example of detailed mapping layout
> for some set of packages. Lets consider that we have the following sequence of
> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
> be inserted to user's vma(vma is large enough).
>
> Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
Hi Arseniy, what about adding a `header` for `virtio_vsock_usr_hdr` in Page 0?
Page 0 can be like this:
Page 0: [[ header for hdrs ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
We can store the header numbers/page numbers in this first general header:
struct virtio_vsock_general_hdr {
uint32_t usr_hdr_num;
};
This usr_hdr_num represents how many pages we used here.
At most 256 pages will be used here, and this kind of statistical information is
useful.
> Page 1: [ 56 ]
> Page 2: [ 4096 ]
> Page 3: [ 4096 ]
> Page 4: [ 4096 ]
> Page 5: [ 8 ]
>
> Page 0 contains only array of headers:
> 'hdr0' has 56 in length field.
> 'hdr1' has 4096 in length field.
> 'hdr2' has 8200 in length field.
> 'hdr3' has 0 in length field(this is end of data marker).
>
> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> Page 2 corresponds to 'hdr1' and filled with data.
> Page 3 corresponds to 'hdr2' and filled with data.
> Page 4 corresponds to 'hdr2' and filled with data.
> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
> This patchset also changes packets allocation way: today implementation
> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
> "not large" could be bigger than one page). So to avoid this, data buffers now
> allocated using 'alloc_pages()' call.
>
> TESTS
>
> This patchset updates 'vsock_test' utility: two tests for new feature
> were added. First test covers invalid cases. Second checks valid transmission
> case.
>
> BENCHMARKING
>
> For benchmakring I've added small utility 'rx_zerocopy'. It works in
> client/server mode. When client connects to server, server starts sending exact
> amount of data to client(amount is set as input argument).Client reads data and
> waits for next portion of it. Client works in two modes: copy and zero-copy. In
> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
> is better. For server, we can set size of tx buffer and for client we can set
> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
> is quiet simple:
>
> For client mode:
>
> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>
> For server mode:
>
> ./rx_zerocopy --mode server [--mb] [--tx]
>
> [--mb] sets number of megabytes to transfer.
> [--rx] sets size of receive buffer/mapping in pages.
> [--tx] sets size of transmit buffer in pages.
>
> I checked for transmission of 4000mb of data. Here are some results:
>
> size of rx/tx buffers in pages
> *---------------------------------------------------*
> | 8 | 32 | 64 | 256 | 512 |
> *--------------*--------*----------*---------*----------*----------*
> | zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
> *--------------*---------------------------------------------------- process
> | non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
> *--------------*----------------------------------------------------
>
> Result in first column(where non-zerocopy works better than zerocopy) happens
> because time, spent in 'read()' system call is smaller that time in 'getsockopt'
> + 'madvise'. I've checked that.
>
> I think, that results are not so impressive, but at least it is not worse than
> copy mode and there is no need to allocate memory for processing date.
>
> PROBLEMS
>
> Updated packet's allocation logic creates some problem: when host gets
> data from guest(in vhost-vsock), it allocates at least one page for each packet
> (even if packet has 1 byte payload). I think this could be resolved in several
> ways:
> 1) Make zerocopy rx mode disabled by default, so if user didn't enable
> it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
> 2) Use 'kmalloc()' for "small" packets, else call page allocator. But
> in this case, we have mix of packets, allocated in two different ways thus
> during zerocopying to user(e.g. mapping pages to vma), such small packets will
> be handled in some stupid way: we need to allocate one page for user, copy data
> to it and then insert page to user's vma.
>
> v1 -> v2:
> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
> didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
> feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
> layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
> SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
> renamed to SO_VM_SOCKETS_MAP_RX.
> 2) Packet header which is exported to user now get new field: 'copy_len'.
> This field handles special case: user reads data from socket in non
> zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
> In this case vhost part will switch data buffer allocation logic from
> 'kmalloc()' to direct calls for buddy allocator. But, there could be
> some pending 'kmalloc()' allocated packets in socket's rx list, and then
> user tries to read such packets in zerocopy way, dequeue will fail,
> because SLAB pages could not be inserted to user's vm area. So when such
> packet is found during zerocopy dequeue, dequeue loop will break and
> 'copy_len' will show size of such "bad" packet. After user detects this
> case, it must use 'read()/recv()' calls to dequeue such packet.
> 3) Also may be move this features under config option?
>
> Arseniy Krasnov(8)
> virtio/vsock: rework packet allocation logic
> vhost/vsock: rework packet allocation logic
> af_vsock: add zerocopy receive logic
> virtio/vsock: add transport zerocopy callback
> vhost/vsock: enable zerocopy callback
> virtio/vsock: enable zerocopy callback
> test/vsock: add receive zerocopy tests
> test/vsock: vsock rx zerocopy utility
>
> drivers/vhost/vsock.c | 121 +++++++++--
> include/linux/virtio_vsock.h | 5 +
> include/net/af_vsock.h | 7 +
> include/uapi/linux/virtio_vsock.h | 6 +
> include/uapi/linux/vm_sockets.h | 3 +
> net/vmw_vsock/af_vsock.c | 100 +++++++++
> net/vmw_vsock/virtio_transport.c | 51 ++++-
> net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++-
> tools/include/uapi/linux/virtio_vsock.h | 11 +
> tools/include/uapi/linux/vm_sockets.h | 8 +
> tools/testing/vsock/Makefile | 1 +
> tools/testing/vsock/control.c | 34 +++
> tools/testing/vsock/control.h | 2 +
> tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++
> tools/testing/vsock/vsock_test.c | 295 ++++++++++++++++++++++++++
> 15 files changed, 1196 insertions(+), 15 deletions(-)
>
> --
> 2.25.1
Powered by blists - more mailing lists