netdev - Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f9377852-773b-a2a2-3064-ee07ad5cc908@sberdevices.ru>
Date:   Tue, 7 Jun 2022 10:26:29 +0000
From:   Arseniy Krasnov <AVKrasnov@...rdevices.ru>
To:     Stefano Garzarella <sgarzare@...hat.com>
CC:     Stefan Hajnoczi <stefanha@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Jason Wang <jasowang@...hat.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "virtualization@...ts.linux-foundation.org" 
        <virtualization@...ts.linux-foundation.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        kernel <kernel@...rdevices.ru>
Subject: Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

On 24.05.2022 10:32, Stefano Garzarella wrote:
> On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:
>> Hello Stefano,
>>
>> On 19.05.2022 10:42, Stefano Garzarella wrote:
>>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>>> Hello Stefano,
>>>>
>>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>>> Hi Arseniy,
>>>>>
>>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>>>                              INTRODUCTION
>>>>>>
>>>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>>>
>>>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>>>
>>>> Yes, that would be great
>>>>
>>>>>
>>>>> Could we do something similar for the sending path as well?
>>>>
>>>> Here are thoughts about zerocopy transmission:
>>>>
>>>> I tried to implement this feature in the following way: user creates
>>>> some page aligned buffer, then during tx packet allocation instead of
>>>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>>>> to virtio queue. But found problem: as kernel virtio API uses virtual
>>>> addresses to add new buffers, in the deep of virtio subsystem
>>>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>>>> virtual address won't be translated correctly to physical address(in
>>>> theory, i can perform page walk for such user's va, get physical address
>>>> and pass some "fake" virtual address to virtio API in order to make
>>>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
>>>
>>> And maybe we should also pin the pages to prevent them from being replaced.
>>>
>>> I think we should do something similar to what we do in vhost-vdpa.
>>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>>
>> Hm, ok. I'll read about vdpa...
>>
>>>
>>>>
>>>>
>>>> If we are talking about 'mmap()' way, i think we can do the following:
>>>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>>>> allocated pages(all pages have rw permissions). Now user can use pages
>>>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>>>> this mapping. Also as this call will return immediately(e.g. it is
>>>> asynchronous), some completion logic must be implemented. For example
>>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>>>> message that pages could be reused, or don't allow user to work with
>>>> these pages: unmap it, perform transmission and finally free pages.
>>>> To start new transmission user need to call 'mmap()' again.
>>>>
>>>>                            OR
>>>>
>>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>>>> /'splice()'. In this approach to transmit something, user does the following
>>>> steps:
>>>> 1) Creates pipe.
>>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>>>   free it.
>>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>>>   move pages from pipe to socket(e.g. in special socket callback we got
>>>>   set of pipe's pages as input argument and all pages will be inserted
>>>>   to virtio queue).
>>>>
>>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
>>>
>>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
>>>
>>>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>>
>>
>> Yes, but in this way i think:
>> 1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
>>   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
>>   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
>>   allows to use such memory as data buffer.
> 
> IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the VQ.

Now i see, how to use 'pin_user_pages()'. I'm going to implement zerocopy tx with the same
API as TCP MSG_ZEROCOPY

> 
>> 2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY),
>>   uses socket's error queue - poll events that tx is finished. So same   way must be
>>   implemented for virtio vsock.
> 
> Yeah, I think so.
> 
>>
>>>  
>>>>
>>>>>
>>>>>>
>>>>>>                                 DETAILS
>>>>>>
>>>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>>>> of data on the corresponding page and 'flags' field):
>>>>>>
>>>>>>     struct virtio_vsock_usr_hdr {
>>>>>>         uint32_t length;
>>>>>>         uint32_t flags;
>>>>>>     };
>>>>>>
>>>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>>>
>>>>>>             Page 0      Page 1      Page N
>>>>>>
>>>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>>>           |        |       ^           ^
>>>>>>           |        |       |           |
>>>>>>           |        *-------------------*
>>>>>>           |                |
>>>>>>           |                |
>>>>>>           *----------------*
>>>>>>
>>>>>>     Of course, single header could represent array of pages (when packet's
>>>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>>>> be inserted to user's vma(vma is large enough).
>>>>>>
>>>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>>>     Page 1: [ 56 ]
>>>>>>     Page 2: [ 4096 ]
>>>>>>     Page 3: [ 4096 ]
>>>>>>     Page 4: [ 4096 ]
>>>>>>     Page 5: [ 8 ]
>>>>>>
>>>>>>     Page 0 contains only array of headers:
>>>>>>     'hdr0' has 56 in length field.
>>>>>>     'hdr1' has 4096 in length field.
>>>>>>     'hdr2' has 8200 in length field.
>>>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>>>
>>>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>>>
>>>>>>     This patchset also changes packets allocation way: today implementation
>>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>>>> allocated using 'alloc_pages()' call.
>>>>>>
>>>>>>                                   TESTS
>>>>>>
>>>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>>>> case.
>>>>>
>>>>> Thanks for adding the test!
>>>>>
>>>>>>
>>>>>>                                BENCHMARKING
>>>>>>
>>>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>>>> client/server mode. When client connects to server, server starts sending exact
>>>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>>>> is quiet simple:
>>>>>>
>>>>>> For client mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>>
>>>>>> For server mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>>
>>>>>> [--mb] sets number of megabytes to transfer.
>>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>>
>>>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>>>
>>>>>>                           size of rx/tx buffers in pages
>>>>>>               *---------------------------------------------------*
>>>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>>>> *--------------*--------*----------*---------*----------*----------*
>>>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>>>> *--------------*---------------------------------------------------- process
>>>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>>>> *--------------*----------------------------------------------------
>>>>>>
>>>>>> I think, that results are not so impressive, but at least it is not worse than
>>>>>> copy mode and there is no need to allocate memory for processing date.
>>>>>
>>>>> Why is it twice as slow in the first column?
>>>>
>>>> May be this is because memory copying for small buffers is very fast... i'll
>>>> analyze it deeply.
>>>
>>> Maybe I misunderstood, by small buffers here what do you mean?
>>>
>>> I thought 8 was the number of pages, so 32KB buffers.
>>
>> Yes, 8 is size in pages. Anyway, i need to check it more deeply.
> 
> Okay, thanks!
> 
> Stefano
>