lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <15cff5cd-52d5-68af-75c1-32be28137773@gmail.com>
Date:   Tue, 5 Jul 2022 16:04:44 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     io-uring@...r.kernel.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Jens Axboe <axboe@...nel.dk>, David Ahern <dsahern@...nel.org>,
        kernel-team@...com
Subject: Re: [PATCH net-next v3 00/25] io_uring zerocopy send

On 7/5/22 16:01, Pavel Begunkov wrote:

NOTE: This is not be picked directly due to cross-subsystem merge problems.
After finding a consensus and getting necessary acks, I'll work out merging
with Jakub and Jens.


> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
> 
>  From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
> 
> Benchmarking with an optimised version of the selftest (see [1]), which in a
> loop sends a bunch of requests and then waits for their completions. "+ flush"
> column posts one additional "buffer-free" notification per request, and
> just "zc" doesn't post buffer notifications at all.
> 
> NIC (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
> 
> dummy (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
> 
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting.
> 
> There is an additional bunch of refcounting optimisations that was omitted from
> the series for simplicity and as they don't change the picture drastically,
> they will be sent as follow up, as well as flushing optimisations closing the
> performance gap b/w two last columns.
> 
> Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
> net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.
> 
> Links:
> 
>    liburing (benchmark + some tests):
>    [1] https://github.com/isilence/liburing/tree/zc_v3
> 
>    kernel repo:
>    [2] https://github.com/isilence/linux/tree/zc_v3
> 
>    RFC v1:
>    [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
> 
>    RFC v2:
>    https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
> 
> API design overview:
> 
>    The series introduces an io_uring concept of notifactors. From the userspace
>    perspective it's an entity to which it can bind one or more requests and then
>    requesting to flush it. Flushing a notifier makes it impossible to attach new
>    requests to it, and instructs the notifier to post a completion once all
>    requests attached to it are completed and the kernel doesn't need the buffers
>    anymore.
> 
>    Notifications are stored in notification slots, which should be registered as
>    an array in io_uring. Each slot stores only one notifier at any particular
>    moment. Flushing removes it from the slot and the slot automatically replaces
>    it with a new notifier. All operations with notifiers are done by specifying
>    an index of a slot it's currently in.
> 
>    When registering a notification the userspace specifies a u64 tag for each
>    slot, which will be copied in notification completion entries as
>    cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
>    sequence number counting notifiers of a slot.
> 
> Changelog:
> 
>    RFC v2 -> v3:
>      mem accounting for non-registered buffers
>      allow mixing registered and normal requests per notifier
>      notification flushing via IORING_OP_RSRC_UPDATE
>      TCP support
>      fix buffer indexing
>      fix io-wq ->uring_lock locking
>      fix bugs when mixing with MSG_ZEROCOPY
>      fix managed refs bugs in skbuff.c
> 
>    RFC -> RFC v2:
>      remove additional overhead for non-zc from skb_release_data()
>      avoid msg propagation, hide extra bits of non-zc overhead
>      task_work based "buffer free" notifications
>      improve io_uring's notification refcounting
>      added 5/19, (no pfmemalloc tracking)
>      added 8/19 and 9/19 preventing small copies with zc
>      misc small changes
> 
> Pavel Begunkov (25):
>    ipv4: avoid partial copy for zc
>    ipv6: avoid partial copy for zc
>    skbuff: add SKBFL_DONT_ORPHAN flag
>    skbuff: carry external ubuf_info in msghdr
>    net: bvec specific path in zerocopy_sg_from_iter
>    net: optimise bvec-based zc page referencing
>    net: don't track pfmemalloc for managed frags
>    skbuff: don't mix ubuf_info of different types
>    ipv4/udp: support zc with managed data
>    ipv6/udp: support zc with managed data
>    tcp: support zc with managed data
>    io_uring: add zc notification infrastructure
>    io_uring: export task put
>    io_uring: cache struct io_notif
>    io_uring: complete notifiers in tw
>    io_uring: add notification slot registration
>    io_uring: wire send zc request type
>    io_uring: account locked pages for non-fixed zc
>    io_uring: allow to pass addr into sendzc
>    io_uring: add rsrc referencing for notifiers
>    io_uring: sendzc with fixed buffers
>    io_uring: flush notifiers after sendzc
>    io_uring: rename IORING_OP_FILES_UPDATE
>    io_uring: add zc notification flush requests
>    selftests/io_uring: test zerocopy send
> 
>   include/linux/io_uring_types.h                |  37 ++
>   include/linux/skbuff.h                        |  59 +-
>   include/linux/socket.h                        |   7 +
>   include/uapi/linux/io_uring.h                 |  43 +-
>   io_uring/Makefile                             |   2 +-
>   io_uring/io_uring.c                           |  40 +-
>   io_uring/io_uring.h                           |  21 +
>   io_uring/net.c                                | 134 ++++
>   io_uring/net.h                                |   4 +
>   io_uring/notif.c                              | 215 +++++++
>   io_uring/notif.h                              |  87 +++
>   io_uring/opdef.c                              |  24 +-
>   io_uring/rsrc.c                               |  55 +-
>   io_uring/rsrc.h                               |  16 +-
>   io_uring/tctx.h                               |  26 -
>   net/compat.c                                  |   2 +
>   net/core/datagram.c                           |  53 +-
>   net/core/skbuff.c                             |  35 +-
>   net/ipv4/ip_output.c                          |  63 +-
>   net/ipv4/tcp.c                                |  52 +-
>   net/ipv6/ip6_output.c                         |  62 +-
>   net/socket.c                                  |   6 +
>   tools/testing/selftests/net/Makefile          |   1 +
>   .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
>   .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
>   25 files changed, 1652 insertions(+), 128 deletions(-)
>   create mode 100644 io_uring/notif.c
>   create mode 100644 io_uring/notif.h
>   create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
>   create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
> 

-- 
Pavel Begunkov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ