netdev - Re: [PATCH net-next v11 00/21] io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c25f6c3f-e576-4c56-ba4b-328dfecbfb35@redhat.com>
Date: Fri, 17 Jan 2025 17:05:15 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Pavel Begunkov <asml.silence@...il.com>, Jakub Kicinski <kuba@...nel.org>
Cc: Jens Axboe <axboe@...nel.dk>, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jesper Dangaard Brouer
 <hawk@...nel.org>, David Ahern <dsahern@...nel.org>,
 Mina Almasry <almasrymina@...gle.com>,
 Stanislav Fomichev <stfomichev@...il.com>, Joe Damato <jdamato@...tly.com>,
 Pedro Tammela <pctammela@...atatu.com>, David Wei <dw@...idwei.uk>,
 io-uring@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v11 00/21] io_uring zero copy rx

On 1/17/25 3:42 PM, Pavel Begunkov wrote:
> On 1/17/25 14:28, Paolo Abeni wrote:
>> On 1/17/25 12:16 AM, David Wei wrote:
>>> This patchset adds support for zero copy rx into userspace pages using
>>> io_uring, eliminating a kernel to user copy.
>>>
>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>> hand out user pages instead of kernel pages. Any data that ends up
>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>> data out of a socket instead becomes a _notification_ mechanism, where
>>> the kernel tells userspace where the data is. The overall approach is
>>> similar to the devmem TCP proposal.
>>>
>>> This relies on hw header/data split, flow steering ad RSS to ensure
>>> packet headers remain in kernel memory and only desired flows hit a hw
>>> rx queue configured for zero copy. Configuring this is outside of the
>>> scope of this patchset.
>>>
>>> We share netdev core infra with devmem TCP. The main difference is that
>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>> to an io_uring instance. Data is 'read' using a new io_uring request
>>> type. When done, data is returned via a new shared refill queue. A zero
>>> copy page pool refills a hw rx queue from this refill queue directly. Of
>>> course, the lifetime of these data buffers are managed by io_uring
>>> rather than the networking stack, with different refcounting rules.
>>>
>>> This patchset is the first step adding basic zero copy support. We will
>>> extend this iteratively with new features e.g. dynamically allocated
>>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>>> general optimisations and more.
>>>
>>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>>> aren't included since Taehee Yoo has already sent a more comprehensive
>>> patchset adding support in [1]. Google gve should already support this,
>>> and Mellanox mlx5 support is WIP pending driver changes.
>>>
>>> ===========
>>> Performance
>>> ===========
>>>
>>> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
>>>
>>> Test setup:
>>> * AMD EPYC 9454
>>> * Broadcom BCM957508 200G
>>> * Kernel v6.11 base [2]
>>> * liburing fork [3]
>>> * kperf fork [4]
>>> * 4K MTU
>>> * Single TCP flow
>>>
>>> With application thread + net rx softirq pinned to _different_ cores:
>>>
>>> +-------------------------------+
>>> | epoll     | io_uring          |
>>> |-----------|-------------------|
>>> | 82.2 Gbps | 116.2 Gbps (+41%) |
>>> +-------------------------------+
>>>
>>> Pinned to _same_ core:
>>>
>>> +-------------------------------+
>>> | epoll     | io_uring          |
>>> |-----------|-------------------|
>>> | 62.6 Gbps | 80.9 Gbps (+29%)  |
>>> +-------------------------------+
>>>
>>> =====
>>> Links
>>> =====
>>>
>>> Broadcom bnxt support:
>>> [1]: https://lore.kernel.org/netdev/20241003160620.1521626-8-ap420073@gmail.com/
>>>
>>> Linux kernel branch:
>>> [2]: https://github.com/spikeh/linux.git zcrx/v9
>>>
>>> liburing for testing:
>>> [3]: https://github.com/isilence/liburing.git zcrx/next
>>>
>>> kperf for testing:
>>> [4]: https://git.kernel.dk/kperf.git
>>
>> We are getting very close to the merge window. In order to get this
>> series merged before such deadline the point raised by Jakub on this
>> version must me resolved, the next iteration should land to the ML
>> before the end of the current working day and the series must apply
>> cleanly to net-next, so that it can be processed by our CI.
> 
> Sounds good, thanks Paolo.
> 
> Since the merging is not trivial, I'll send a PR for the net/
> patches instead of reposting the entire thing, if that sounds right
> to you. The rest will be handled on the io_uring side.

I agree it is the more straight-forward path. @Jakub: do you see any
problem with the above?

/P