[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSeP-W-ePV1EkWMmD4Ycsfq9viYdtyfDbUW3LXTc2q+BHQ@mail.gmail.com>
Date: Wed, 1 Dec 2021 19:36:22 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [RFC 00/12] io_uring zerocopy send
> >>> 1) we pass a bvec, so no page table walks.
> >>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
> >>> still doing page get/put (see 4/12) slashed 4-5%.
> >>> 3) avoiding get_page/put_page in 5/12
> >>> 4) completion events are posted into io_uring's CQ, so no
> >>> extra recvmsg for getting events
> >>> 5) no poll(2) in the code because of io_uring
> >>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
> >>> io_uring caches the structures reducing it to nearly zero-overhead.
> >>
> >> Nice set of complementary optimizations.
> >>
> >> We have looked at adding some of those as independent additions to
> >> msg_zerocopy before, such as long-term pinned regions. One issue with
> >> that is that the pages must remain until the request completes,
> >> regardless of whether the calling process is alive. So it cannot rely
> >> on a pinned range held by a process only.
> >>
> >> If feasible, it would be preferable if the optimizations can be added
> >> to msg_zerocopy directly, rather than adding a dependency on io_uring
> >> to make use of them. But not sure how feasible that is. For some, like
> >> 4 and 5, the answer is clearly it isn't. 6, it probably is?
>
> Forgot about 6), io_uring uses the fact that submissions are
> done under an per ring mutex, and completions are under a per
> ring spinlock, so there are two lists for them and no extra
> locking. Lists are spliced in a batched manner, so it's
> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
>
> Any similar guarantees for sockets?
For datagrams it might matter, not sure if it would show up in a
profile. The current notification mechanism is quite a bit more
heavyweight than any form of fixed ubuf pool.
For TCP this matters less, as multiple sends are not needed and
completions are coalesced, because in order.
Powered by blists - more mailing lists