lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 11 Jul 2022 13:56:03 +0100 From: Pavel Begunkov <asml.silence@...il.com> To: David Ahern <dsahern@...nel.org>, io-uring@...r.kernel.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Jonathan Lemon <jonathan.lemon@...il.com>, Willem de Bruijn <willemb@...gle.com>, Jens Axboe <axboe@...nel.dk>, kernel-team@...com Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send On 7/8/22 15:26, Pavel Begunkov wrote: > On 7/8/22 05:10, David Ahern wrote: >> On 7/7/22 5:49 AM, Pavel Begunkov wrote: >>> NOTE: Not be picked directly. After getting necessary acks, I'll be working >>> out merging with Jakub and Jens. >>> >>> The patchset implements io_uring zerocopy send. It works with both registered >>> and normal buffers, mixing is allowed but not recommended. Apart from usual >>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies >>> the userspace when buffers are freed and can be reused (see API design below), >>> which is delivered into io_uring's Completion Queue. Those "buffer-free" >>> notifications are not necessarily per request, but the userspace has control >>> over it and should explicitly attaching a number of requests to a single >>> notification. The series also adds some internal optimisations when used with >>> registered buffers like removing page referencing. >>> >>> From the kernel networking perspective there are two main changes. The first >>> one is passing ubuf_info into the network layer from io_uring (inside of an >>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info >>> caching on the io_uring side, but also helps to avoid cross-referencing >>> and synchronisation problems. The second part is an optional optimisation >>> removing page referencing for requests with registered buffers. >>> >>> Benchmarking with an optimised version of the selftest (see [1]), which sends >>> a bunch of requests, waits for completions and repeats. "+ flush" column posts >>> one additional "buffer-free" notification per request, and just "zc" doesn't >>> post buffer notifications at all. >>> >>> NIC (requests / second): >>> IO size | non-zc | zc | zc + flush >>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >>> >>> dummy (requests / second): >>> IO size | non-zc | zc | zc + flush >>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >>> >>> Previously it also brought a massive performance speedup compared to the >>> msg_zerocopy tool (see [3]), which is probably not super interesting. >>> >> >> can you add a comment that the above results are for UDP. > > Oh, right, forgot to add it > > >> You dropped comments about TCP testing; any progress there? If not, can >> you relay any issues you are hitting? > > Not really a problem, but for me it's bottle necked at NIC bandwidth > (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. > Was actually benchmarked by my colleague quite a while ago, but can't > find numbers. Probably need to at least add localhost numbers or grab > a better server. Testing localhost TCP with a hack (see below), it doesn't include refcounting optimisations I was testing UDP with and that will be sent afterwards. Numbers are in MB/s IO size | non-zc | zc 1200 | 4174 | 4148 4096 | 7597 | 11228 Because it's localhost, we also spend cycles here for the recv side. Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the omitted optimisations will somewhat help. I don't consider it to be a blocker. but would be interesting to poke into later. One thing helping non-zc is that it squeezes a number of requests into a single page whenever zerocopy adds a new frag for every request. Can't say anything new for larger payloads, I'm still NIC-bound but looking at CPU utilisation zc doesn't drain as much cycles as non-zc. Also, I don't remember if mentioned before, but another catch is that with TCP it expects users to not be flushing notifications too much, because it forces it to allocate a new skb and lose a good chunk of benefits from using TCP. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 1111adefd906..c4b781b2c3b1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3218,9 +3218,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */ static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!skb_zcopy(skb))) - return 0; - return skb_copy_ubufs(skb, gfp_mask); + return skb_orphan_frags(skb, gfp_mask); } -- Pavel Begunkov
Powered by blists - more mailing lists