netdev - [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <f60d7e94-795d-06fd-0321-6972533700c5@sberdevices.ru>
Date:   Sun, 6 Nov 2022 19:33:41 +0000
From:   Arseniy Krasnov <AVKrasnov@...rdevices.ru>
To:     Stefan Hajnoczi <stefanha@...hat.com>,
        Stefano Garzarella <sgarzare@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Jason Wang <jasowang@...hat.com>,
        "David S. Miller" <davem@...emloft.net>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Krasnov Arseniy <oxffffaa@...il.com>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "virtualization@...ts.linux-foundation.org" 
        <virtualization@...ts.linux-foundation.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        kernel <kernel@...rdevices.ru>
Subject: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive



                              INTRODUCTION

Hello,

	This is experimental patchset for virtio vsock zerocopy receive.
It was inspired by TCP zerocopy receive by Eric Dumazet. This API  uses
same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()'
to fill provided vma area with pages of virtio receive buffers.   After
received data was processed by user, pages must be freed by 'madvise()'
call with MADV_DONTNEED flag set(but if user will not call 'madvise()',
next 'getsockopt()' will fail).

                                 DETAILS

	Here is how mapping with mapped pages looks exactly: first page
contains information about mapped data buffers. At zero offset mapping
contains special data structure:

	struct virtio_vsock_usr_hdr_pref {
	       u32 poll_value;
	       u32 hdr_num;
	};

This structure contains two fields:
'poll_value' - shows that current socket has data to read.When socket's
intput queue is empty,  'poll_value' is set to 0 by kernel.  When input
queue has some data, 'poll_value' is set to 1 by kernel. When socket is
closed for data receive, 'poll_value' is ~0.This tells user that "there
will be no more data,continue to call 'getsockopt()' until you'll  find
'hdr_num' == 0".User spins on it in userspace, without calling 'poll()'
system call(of course, 'poll()' is still working).
'hdr_num' - shows number of mapped pages with data which starts from
second page of this mappined.

NOTE:
   This version has two limitations:

   1) One mapping per socket is supported.  It is implemented by adding
      'struct page*' pointer to  'struct virtio_vsock' structure (first
      page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But,
      I think, support for multiple pages could be implemented by using
      something like hash table of such pages, or more simple, just use
      first page of mapping as headers page by default. Also I think,
      number of such pages may be controlled by 'setsockop()'.

   2) After 'mmap()' call,it is impossible to call 'mmap()' again, even
      after calling 'madvise()'/'munmap()' on the whole mapping.This is
      because socket can't handle 'munmap()' calls(as there is no such
      callback in 'proto_ops'),thus polling page exists until socket is
      opened.

After 'virtio_vsock_usr_hdr_pref' object,  first page contains array of
trimmed virtio vsock packet headers (in contains only length of data on
the corresponding page and 'flags' field):

	struct virtio_vsock_usr_hdr {
		uint32_t length;
		uint32_t flags;
	};

Field  'length'  allows user to know  exact size of payload within each
sequence of pages and field 'flags' allows  to  process SOCK_SEQPACKET
flags(such as message bounds or record bounds).All other pages are data
pages from virtio queue.

                Page 0        Page 1      Page N

	[ pref hdr0 .. hdrN ][ data ] .. [ data ]
                |        |       ^           ^
                |        |       |           |
                |        *-------|-----------*
                |                |
                *----------------*

	Of course, single header could represent array of pages (when
packet's buffer is bigger than one page).So here is example of detailed
mapping layout for some set of packages. Lets consider that we have the
following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All
pages: 0,1,2,3,4 and 5 will be inserted to user's vma.

	Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
	Page 1: [ 56 ]
	Page 2: [ 4096 ]
	Page 3: [ 4096 ]
	Page 4: [ 4096 ]
	Page 5: [ 8 ]

	Page 0 contains only array of headers:
	'pref' is 'struct virtio_vsock_usr_hdr_pref'.
	'hdr0' has 56 in length field.
	'hdr1' has 4096 in length field.
	'hdr2' has 8200 in length field.
	'hdr3' has 0 in length field(this is end of data marker).

	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
	Page 2 corresponds to 'hdr1' and filled with data.
	Page 3 corresponds to 'hdr2' and filled with data.
	Page 4 corresponds to 'hdr2' and filled with data.
	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

        pref will be the following: poll_value = 1, hdr_num = 5

	This patchset also changes packets allocation way: current uses
only 'kmalloc()' to create data buffer.  Problem happens when we try to
map such buffers  to user's vma - kernel  restricts to map  slab pages
to user's vma(as pages of "not large" 'kmalloc()' allocations have flag
PageSlab set and "not large" could be bigger than one page).So to avoid
this, data buffers now allocated using 'alloc_pages()' call.

                             DIFFERENCE WITH TCP

	As this feature uses same approach as for TCP protocol,here are
some difference between both version(from user's POV):

1) For 'getsockopt()':
   - This version passes only address of mapping.
   - TCP passes special structure to 'getsockopt()'. In addition to the
     address of mapping in contains 'length' and 'recv_skip_hint'.First
     means size of data inside mapping(out param, set by kernel).Second
     has bool type, if it is true, then user must dequeue rest of  data
     using 'read()' syscall(e.g. it is out parameter also).
2) Mapping structure:
   - This version uses first page of mapping for meta data and rest of
     pages for data.
   - TCP version uses whole mapping for data only.
3) Data layout:
   - This version inserts virtio buffers to mapping, so each buffer may
     be filled partially. To get size of payload in every buffer, first
     mapping's page must be used(see 2).
   - TCP version inserts pages of single skb.

*Please, correct me if I made some mistake in TCP zerocopy description.

                                TESTS

	This patchset updates 'vsock_test' utility: two tests for new
feature were added. First test covers invalid cases.Second checks valid
transmission case.

                            BENCHMARKING

	For benchmakring I've created small test utility 'vsock_rx_perf'.
It works in client/server mode.  When client connects to server, server
starts sending specified amount of data to client(size is set as input
argument). Client reads data and waits for next portion of it. In client
mode, dequeue logic works in three modes: copy, zerocopy and zerocopy
with user polling.

1) In copy mode client uses 'read()' system call.
2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data
   and 'poll()' to wait data.
3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()',
   but to wait data it polls shared page(e.g. busyloop).

Here is usage:
-c <cid>	Peer CID to connect to(if run in client mode).
-m <megabytes>	Number of megabytes to send.
-b <bytes>	Size of RX/TX buffer(or mapping) in pages.
-r <bytes>	SO_RCVLOWAT value in bytes(if run in client mode).
-v <bytes>	peer credit.
-s		Run as server.
-z [n|y|u]	Dequeue mode.
		n - copy mode. 1) above.
		y - zero copy mode. 2) above.
		u - zero copy mode + user poll. 3) above.

Utility produces the following output:
1) In server mode it prints number of sec spent for whole tx loop.
2) In client mode it prints several values:
   * Number of sec, spent for whole rx loop(including 'poll()').
   * Number of sec, spend in dequeue system calls:
     In case of '-z n' it will be time in 'read()'.
     In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'.
   * Number of wake ups with POLLIN flag set(except '-z u' mode).
   * Average time(ns) in single dequeue iteration(e.g. divide second
     value by third).

Idea of test is to compare zerocopy approach and classic copy, as it is
clear, that to dequeue some "small" amount of data, copy must be better,
because syscall with 'memcpy()' for 1 byte(for example) is just nothing
against two system calls, where first must map at least one page, while
second will unmap it.

Test was performed with the following settings:
1) Total size of data to send is 2G(-m argument).

2) Peer's buffer size is changed to 2G(-v argument) - this is needed to
   avoid stalls of sender to wait for enough credit.

3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number
   of bytes to dequeue in single loop iteration. Buffer size limits max
   number of bytes to read, while SO_RCVLOWAT won't allow user to get
   too small number of bytes.

4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of
   course we can set it to peer's buffer size and as we are in STREAM
   mode it leads to 'write()' will be called once.

Deignations here and below:
H2G - host to guest transmission. Server is host, client is guest.
G2H - guest to host transmission. Server is guest, client is host.
C   - copy mode.
ZC  - zerocopy mode.
ZU  - zerocopy with user poll mode. This mode is removed from test at
      this moment, because I need to support SO_RCVLOWAT logic in it.

So, rows corresponds to dequeue mode, while columns show number of bytes
to dequeue in each mode. Each cell contains several values in the next
format:
*------------*
|   A / B    |
|     C      |
|     D      |
*------------*

A - number of seconds which server spent in tx loop.
B - number of seconds which client spent in rx loop.
C - number of seconds which client spent in rx loop, but except 'poll()'
    system call(e.g. only in dequeue system calls).
D - Average number of ns for each POLLIN wake up(in other words
    it is average value for C).

G2H:

            #0        #1        #2        #3        #4        #5
  *----*---------*---------*---------*---------*---------*---------*
  |    |         |         |         |         |         |         |
  |    |   4Kb   |   16Kb  |   64Kb  |  128Kb  |  256Kb  |  512Kb  |
  |    |         |         |         |         |         |         |
  *----*---------*---------*---------*---------*---------*---------*
  |    | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
  |    |  7039   |  15074  |  34975  |  89938  |  162384 |  438278 |
  *----*---------*---------*---------*---------*---------*---------*
  |    |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
  |    |  13598  |  15821  |  29574  |  43265  |  71771  |  150927 |
  *----*---------*---------*---------*---------*---------*---------*

H2G:

            #0        #1        #2        #3        #4        #5
  *----*---------*---------*---------*---------*---------*---------*
  |    |         |         |         |         |         |         |
  |    |   4Kb   |   16Kb  |   64Kb  |  128Kb  |  256Kb  |  512Kb  |
  |    |         |         |         |         |         |         |
  *----*---------*---------*---------*---------*---------*---------*
  |    | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
  |    |  17145  |  24172  |  72650  |  143496 |  295960 |  674146 |
  *----*---------*---------*---------*---------*---------*---------*
  |    |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
  |    |  41855  |  46339  |  71988  |  106000 |  153064 |  242036 |
  *----*---------*---------*---------*---------*---------*---------*

Here are my thoughts about these numbers(most obvious):

1) Let's check C and D values. We see, that zerocopy dequeue is faster
   on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I
   think this is main result of this test(at this moment), that shows
   performance difference between copy and zerocopy).

2) In G2H mode both server and client spend almost same time in rx/tx
   loops(see A / B in G2H table) - it looks good. In H2G mode, there is
   significant difference between server and client. I think  there are
   some side effects which produces such effect(continue to analyze).

3) Let's check C value. We can see, that G2H is always faster that H2G.
   In both copy and zerocopy mode.

4) Another interesting thing could be seen for example in H2G table,
   row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode
   is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was
   faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account
   time spent in 'poll()', copy mode looks faster(even it spends more
   time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()').
   I think, it is also not obvious effect.

So, according 1), it is better to use zerocopy, if You need to process
big buffers, with small rx waitings(for example it nay be video stream).
In other cases - it is better to use classic copy way, as it will be
more lightweight.

All tests were performed on x86 board with 4-core Celeron N2930 CPU(of
course it is, not a mainframe, but better than test with nested guest)
and 8Gb of RAM.

Anyway, this is not final version, and I will continue to improve both
kernel logic and performance tests.

                           SUGGESTIONS

1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I
   can merge both patches into single one?
2) This version works with single page headers. May be I can add new
   'getsockopt()' feature to allow to use multiple pages for headers.

                           CHANGE LOG
v1 -> v2:
 1) Zerocopy receive mode must  be enabled/disabled (off by default). I
    did not use generic SO_ZEROCOPY flag,  because in virtio-vsock case
    this feature depends on transport support. Instead of SO_ZEROCOPY,
    AF_VSOCK layer flag was added:SO_VM_SOCKETS_ZEROCOPY,while previous
    meaning of SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's
    vm area) now renamed to SO_VM_SOCKETS_MAP_RX.
 2) Packet header which is exported to user now gets new extra field:
    'copy_len'.  This field handles special case: user reads data from
    socket in non zerocopy way(with disabled zerocopy) and then enables
    zerocopy.
    In this case, vhost part will  switch  buffer allocation logic from
    'kmalloc()' to direct calls for buddy allocator. But,there could be
    some pending 'kmalloc()' allocated packets in socket's rx list, and
    then user tries to read such packets in zerocopy way, dequeue will
    fail, because SLAB pages could not be inserted to user's vm area.So
    when such packet is found during zerocopy dequeue,dequeue loop will
    break and 'copy_len' will show size of such "bad" packet.After user
    detects this case,it must use 'read()/recv()' calls to dequeue such
    packet.
 3) Also may be move this features under config option? <<<<< DECLINED

v2 -> v3:
 1) Zerocopy  could be enabled only for socket in SS_UNCONNECTED state,
    so 'copy_len' field from v2 was removed.
 2) Mapping layout was updated.  First page of mapping now contains the
    following structure: 'struct virtio_vsock_usr_hdr_pref'    starting
    from the first byte. Then 'struct virtio_vsock_usr_hdr'  are placed
    in array.
 3) Transport get/set callbacks for zerocopy were removed, now flag to
    check zerocopy receive on/off is storead in 'vsock_sock'.
 4) For 'virtio_transport_recv_pkt()' interface changed. This was done,
    because vhost driver needs to check whether zerocopy is enabled on
    socket or not.So socket lookup is performed until packet allocation
    and socket structure is passed to this function.

     void virtio_transport_recv_pkt(struct virtio_transport *,
                                    struct virtio_vsock_pkt *);

     changed to

     void virtio_transport_recv_pkt(struct virtio_transport *,
                                    struct sock *,
                                    struct virtio_vsock_pkt *);

    If 'struct sock *' argument is NULL, this function works as before,
    otherwise it skips socket lookup, using input 'sock' as destination
    socket.

 4) Test for userspace polling was added.
 5) Zerocopy tests were moved to dedicated '.c' file.
 6) More benchmark results.
 7) Two tests updated: message bound test reworked and test for big
    message transmission was added.

Arseniy Krasnov(11):
 virtio/vsock: rework packet allocation logic
 virtio/vsock: update 'virtio_transport_recv_pkt()'
 af_vsock: add zerocopy receive logic
 virtio/vsock: add transport zerocopy callback
 vhost/vsock: switch packet's buffer allocation
 vhost/vsock: enable zerocopy callback
 virtio/vsock: enable zerocopy callback
 test/vsock: rework message bound test
 test/vsock: add big message test
 test/vsock: add receive zerocopy tests
 test/vsock: vsock_rx_perf utility

 drivers/vhost/vsock.c                     |  58 ++-
 include/linux/virtio_vsock.h              |   8 +
 include/net/af_vsock.h                    |   8 +
 include/uapi/linux/virtio_vsock.h         |  14 +
 include/uapi/linux/vm_sockets.h           |   3 +
 net/vmw_vsock/af_vsock.c                  | 187 ++++++++-
 net/vmw_vsock/virtio_transport.c          |   4 +-
 net/vmw_vsock/virtio_transport_common.c   | 256 ++++++++++++-
 net/vmw_vsock/vsock_loopback.c            |   2 +-
 tools/include/uapi/linux/virtio_vsock.h   |  15 +
 tools/include/uapi/linux/vm_sockets.h     |   8 +
 tools/testing/vsock/Makefile              |   3 +-
 tools/testing/vsock/control.c             |  34 ++
 tools/testing/vsock/control.h             |   2 +
 tools/testing/vsock/util.c                |  40 +-
 tools/testing/vsock/util.h                |   5 +
 tools/testing/vsock/vsock_rx_perf.c       | 604 ++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_test.c          | 198 +++++++++-
 tools/testing/vsock/vsock_test_zerocopy.c | 530 ++++++++++++++++++++++++++
 tools/testing/vsock/vsock_test_zerocopy.h |  14 +
 20 files changed, 1953 insertions(+), 40 deletions(-)

-- 
2.35.0