[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <15cff5cd-52d5-68af-75c1-32be28137773@gmail.com>
Date: Tue, 5 Jul 2022 16:04:44 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: io-uring@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Cc: "David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Willem de Bruijn <willemb@...gle.com>,
Jens Axboe <axboe@...nel.dk>, David Ahern <dsahern@...nel.org>,
kernel-team@...com
Subject: Re: [PATCH net-next v3 00/25] io_uring zerocopy send
On 7/5/22 16:01, Pavel Begunkov wrote:
NOTE: This is not be picked directly due to cross-subsystem merge problems.
After finding a consensus and getting necessary acks, I'll work out merging
with Jakub and Jens.
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
>
> From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
>
> Benchmarking with an optimised version of the selftest (see [1]), which in a
> loop sends a bunch of requests and then waits for their completions. "+ flush"
> column posts one additional "buffer-free" notification per request, and
> just "zc" doesn't post buffer notifications at all.
>
> NIC (requests / second):
> IO size | non-zc | zc | zc + flush
> 4000 | 495134 | 606420 (+22%) | 558971 (+12%)
> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)
>
> dummy (requests / second):
> IO size | non-zc | zc | zc + flush
> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)
>
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting.
>
> There is an additional bunch of refcounting optimisations that was omitted from
> the series for simplicity and as they don't change the picture drastically,
> they will be sent as follow up, as well as flushing optimisations closing the
> performance gap b/w two last columns.
>
> Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
> net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.
>
> Links:
>
> liburing (benchmark + some tests):
> [1] https://github.com/isilence/liburing/tree/zc_v3
>
> kernel repo:
> [2] https://github.com/isilence/linux/tree/zc_v3
>
> RFC v1:
> [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
>
> RFC v2:
> https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
>
> API design overview:
>
> The series introduces an io_uring concept of notifactors. From the userspace
> perspective it's an entity to which it can bind one or more requests and then
> requesting to flush it. Flushing a notifier makes it impossible to attach new
> requests to it, and instructs the notifier to post a completion once all
> requests attached to it are completed and the kernel doesn't need the buffers
> anymore.
>
> Notifications are stored in notification slots, which should be registered as
> an array in io_uring. Each slot stores only one notifier at any particular
> moment. Flushing removes it from the slot and the slot automatically replaces
> it with a new notifier. All operations with notifiers are done by specifying
> an index of a slot it's currently in.
>
> When registering a notification the userspace specifies a u64 tag for each
> slot, which will be copied in notification completion entries as
> cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
> sequence number counting notifiers of a slot.
>
> Changelog:
>
> RFC v2 -> v3:
> mem accounting for non-registered buffers
> allow mixing registered and normal requests per notifier
> notification flushing via IORING_OP_RSRC_UPDATE
> TCP support
> fix buffer indexing
> fix io-wq ->uring_lock locking
> fix bugs when mixing with MSG_ZEROCOPY
> fix managed refs bugs in skbuff.c
>
> RFC -> RFC v2:
> remove additional overhead for non-zc from skb_release_data()
> avoid msg propagation, hide extra bits of non-zc overhead
> task_work based "buffer free" notifications
> improve io_uring's notification refcounting
> added 5/19, (no pfmemalloc tracking)
> added 8/19 and 9/19 preventing small copies with zc
> misc small changes
>
> Pavel Begunkov (25):
> ipv4: avoid partial copy for zc
> ipv6: avoid partial copy for zc
> skbuff: add SKBFL_DONT_ORPHAN flag
> skbuff: carry external ubuf_info in msghdr
> net: bvec specific path in zerocopy_sg_from_iter
> net: optimise bvec-based zc page referencing
> net: don't track pfmemalloc for managed frags
> skbuff: don't mix ubuf_info of different types
> ipv4/udp: support zc with managed data
> ipv6/udp: support zc with managed data
> tcp: support zc with managed data
> io_uring: add zc notification infrastructure
> io_uring: export task put
> io_uring: cache struct io_notif
> io_uring: complete notifiers in tw
> io_uring: add notification slot registration
> io_uring: wire send zc request type
> io_uring: account locked pages for non-fixed zc
> io_uring: allow to pass addr into sendzc
> io_uring: add rsrc referencing for notifiers
> io_uring: sendzc with fixed buffers
> io_uring: flush notifiers after sendzc
> io_uring: rename IORING_OP_FILES_UPDATE
> io_uring: add zc notification flush requests
> selftests/io_uring: test zerocopy send
>
> include/linux/io_uring_types.h | 37 ++
> include/linux/skbuff.h | 59 +-
> include/linux/socket.h | 7 +
> include/uapi/linux/io_uring.h | 43 +-
> io_uring/Makefile | 2 +-
> io_uring/io_uring.c | 40 +-
> io_uring/io_uring.h | 21 +
> io_uring/net.c | 134 ++++
> io_uring/net.h | 4 +
> io_uring/notif.c | 215 +++++++
> io_uring/notif.h | 87 +++
> io_uring/opdef.c | 24 +-
> io_uring/rsrc.c | 55 +-
> io_uring/rsrc.h | 16 +-
> io_uring/tctx.h | 26 -
> net/compat.c | 2 +
> net/core/datagram.c | 53 +-
> net/core/skbuff.c | 35 +-
> net/ipv4/ip_output.c | 63 +-
> net/ipv4/tcp.c | 52 +-
> net/ipv6/ip6_output.c | 62 +-
> net/socket.c | 6 +
> tools/testing/selftests/net/Makefile | 1 +
> .../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
> .../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
> 25 files changed, 1652 insertions(+), 128 deletions(-)
> create mode 100644 io_uring/notif.c
> create mode 100644 io_uring/notif.h
> create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
> create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>
--
Pavel Begunkov
Powered by blists - more mailing lists