linux-kernel - Re: [RFC 00/12] io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d5a07e01-7fc3-2f73-a406-21246a252876@gmail.com>
Date:   Wed, 1 Dec 2021 20:29:30 +0000
From:   Pavel Begunkov <asml.silence@...il.com>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     io-uring@...r.kernel.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 19:59, Pavel Begunkov wrote:
> On 12/1/21 18:10, Willem de Bruijn wrote:
>>> # performance:
>>>
>>> The worst case for io_uring is (4), still 1.88 times faster than
>>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>>> out from the patchset. For 4096 bytes payload zc is only slightly
>>> outperforms non-zc version, the larger payload the wider gap.
>>> I'll get more numbers next time.
>>
>>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>>> much. Notification posting is not a big problem for now, but need
>>> to compare the performance for when io_uring_tx_zerocopy_callback()
>>> is called from IRQ context, and possible rework it to use task_work.
>>>
>>> It supports both, regular buffers and fixed ones, but there is a bunch of
>>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>>
>>> 1) we pass a bvec, so no page table walks.
>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>>     still doing page get/put (see 4/12) slashed 4-5%.
>>> 3) avoiding get_page/put_page in 5/12
>>> 4) completion events are posted into io_uring's CQ, so no
>>>     extra recvmsg for getting events
>>> 5) no poll(2) in the code because of io_uring
>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>>     io_uring caches the structures reducing it to nearly zero-overhead.
>>
>> Nice set of complementary optimizations.
>>
>> We have looked at adding some of those as independent additions to
>> msg_zerocopy before, such as long-term pinned regions. One issue with
>> that is that the pages must remain until the request completes,
>> regardless of whether the calling process is alive. So it cannot rely
>> on a pinned range held by a process only.
>>
>> If feasible, it would be preferable if the optimizations can be added
>> to msg_zerocopy directly, rather than adding a dependency on io_uring
>> to make use of them. But not sure how feasible that is. For some, like
>> 4 and 5, the answer is clearly it isn't.  6, it probably is?

Forgot about 6), io_uring uses the fact that submissions are
done under an per ring mutex, and completions are under a per
ring spinlock, so there are two lists for them and no extra
locking. Lists are spliced in a batched manner, so it's
1 spinlock per N (e.g. 32) cached ubuf_info's allocations.

Any similar guarantees for sockets?

[...]

-- 
Pavel Begunkov