netdev - Re: [PATCH net-next v4 00/27] io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f54508f-e819-e367-84c2-7aa0d7767097@gmail.com>
Date:   Mon, 11 Jul 2022 13:56:03 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     David Ahern <dsahern@...nel.org>, io-uring@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Jens Axboe <axboe@...nel.dk>, kernel-team@...com
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/8/22 15:26, Pavel Begunkov wrote:
> On 7/8/22 05:10, David Ahern wrote:
>> On 7/7/22 5:49 AM, Pavel Begunkov wrote:
>>> NOTE: Not be picked directly. After getting necessary acks, I'll be working
>>>        out merging with Jakub and Jens.
>>>
>>> The patchset implements io_uring zerocopy send. It works with both registered
>>> and normal buffers, mixing is allowed but not recommended. Apart from usual
>>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
>>> the userspace when buffers are freed and can be reused (see API design below),
>>> which is delivered into io_uring's Completion Queue. Those "buffer-free"
>>> notifications are not necessarily per request, but the userspace has control
>>> over it and should explicitly attaching a number of requests to a single
>>> notification. The series also adds some internal optimisations when used with
>>> registered buffers like removing page referencing.
>>>
>>>  From the kernel networking perspective there are two main changes. The first
>>> one is passing ubuf_info into the network layer from io_uring (inside of an
>>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
>>> caching on the io_uring side, but also helps to avoid cross-referencing
>>> and synchronisation problems. The second part is an optional optimisation
>>> removing page referencing for requests with registered buffers.
>>>
>>> Benchmarking with an optimised version of the selftest (see [1]), which sends
>>> a bunch of requests, waits for completions and repeats. "+ flush" column posts
>>> one additional "buffer-free" notification per request, and just "zc" doesn't
>>> post buffer notifications at all.
>>>
>>> NIC (requests / second):
>>> IO size | non-zc    | zc             | zc + flush
>>> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
>>> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
>>> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
>>> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
>>>
>>> dummy (requests / second):
>>> IO size | non-zc    | zc             | zc + flush
>>> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
>>> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
>>> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
>>> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
>>>
>>> Previously it also brought a massive performance speedup compared to the
>>> msg_zerocopy tool (see [3]), which is probably not super interesting.
>>>
>>
>> can you add a comment that the above results are for UDP.
> 
> Oh, right, forgot to add it
> 
> 
>> You dropped comments about TCP testing; any progress there? If not, can
>> you relay any issues you are hitting?
> 
> Not really a problem, but for me it's bottle necked at NIC bandwidth
> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
> Was actually benchmarked by my colleague quite a while ago, but can't
> find numbers. Probably need to at least add localhost numbers or grab
> a better server.

Testing localhost TCP with a hack (see below), it doesn't include
refcounting optimisations I was testing UDP with and that will be
sent afterwards. Numbers are in MB/s

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Because it's localhost, we also spend cycles here for the recv side.
Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help. I don't consider it to be a
blocker. but would be interesting to poke into later. One thing helping
non-zc is that it squeezes a number of requests into a single page
whenever zerocopy adds a new frag for every request.

Can't say anything new for larger payloads, I'm still NIC-bound but
looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
Also, I don't remember if mentioned before, but another catch is that
with TCP it expects users to not be flushing notifications too much,
because it forces it to allocate a new skb and lose a good chunk of
benefits from using TCP.


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1111adefd906..c4b781b2c3b1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3218,9 +3218,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
  /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
  static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
  {
-	if (likely(!skb_zcopy(skb)))
-		return 0;
-	return skb_copy_ubufs(skb, gfp_mask);
+	return skb_orphan_frags(skb, gfp_mask);
  }

-- 
Pavel Begunkov