[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce9caef4-0d95-4e81-bdb8-536236377f81@gmail.com>
Date: Fri, 17 Jan 2025 14:42:30 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Paolo Abeni <pabeni@...hat.com>, David Wei <dw@...idwei.uk>,
io-uring@...r.kernel.org, netdev@...r.kernel.org
Cc: Jens Axboe <axboe@...nel.dk>, Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jesper Dangaard Brouer <hawk@...nel.org>, David Ahern <dsahern@...nel.org>,
Mina Almasry <almasrymina@...gle.com>,
Stanislav Fomichev <stfomichev@...il.com>, Joe Damato <jdamato@...tly.com>,
Pedro Tammela <pctammela@...atatu.com>
Subject: Re: [PATCH net-next v11 00/21] io_uring zero copy rx
On 1/17/25 14:28, Paolo Abeni wrote:
> On 1/17/25 12:16 AM, David Wei wrote:
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance. Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> +-------------------------------+
>> | epoll | io_uring |
>> |-----------|-------------------|
>> | 82.2 Gbps | 116.2 Gbps (+41%) |
>> +-------------------------------+
>>
>> Pinned to _same_ core:
>>
>> +-------------------------------+
>> | epoll | io_uring |
>> |-----------|-------------------|
>> | 62.6 Gbps | 80.9 Gbps (+29%) |
>> +-------------------------------+
>>
>> =====
>> Links
>> =====
>>
>> Broadcom bnxt support:
>> [1]: https://lore.kernel.org/netdev/20241003160620.1521626-8-ap420073@gmail.com/
>>
>> Linux kernel branch:
>> [2]: https://github.com/spikeh/linux.git zcrx/v9
>>
>> liburing for testing:
>> [3]: https://github.com/isilence/liburing.git zcrx/next
>>
>> kperf for testing:
>> [4]: https://git.kernel.dk/kperf.git
>
> We are getting very close to the merge window. In order to get this
> series merged before such deadline the point raised by Jakub on this
> version must me resolved, the next iteration should land to the ML
> before the end of the current working day and the series must apply
> cleanly to net-next, so that it can be processed by our CI.
Sounds good, thanks Paolo.
Since the merging is not trivial, I'll send a PR for the net/
patches instead of reposting the entire thing, if that sounds right
to you. The rest will be handled on the io_uring side.
--
Pavel Begunkov
Powered by blists - more mailing lists