netdev - Re: [PATCH v1 00/15] io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izOv9cB60oUbxz_52BMGi7T4_u9rzTOCb23LGvZOX0QXqg@mail.gmail.com>
Date: Wed, 9 Oct 2024 09:55:06 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: David Wei <dw@...idwei.uk>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org, 
	Jens Axboe <axboe@...nel.dk>, Pavel Begunkov <asml.silence@...il.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, David Ahern <dsahern@...nel.org>
Subject: Re: [PATCH v1 00/15] io_uring zero copy rx

On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@...idwei.uk> wrote:
>
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
>
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
>
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.
>
> We share netdev core infra with devmem TCP. The main difference is that
> io_uring is used for the uAPI and the lifetime of all objects are bound
> to an io_uring instance.

I've been thinking about this a bit, and I hope this feedback isn't
too late, but I think your work may be useful for users not using
io_uring. I.e. zero copy to host memory that is not dependent on page
aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

If we refactor things around a bit we should be able to have the
memory tied to the RX queue similar to what AF_XDP does, and then we
should be able to zero copy to the memory via regular sockets and via
io_uring. This will be useful for us and other applications that would
like to ZC similar to what you're doing here but not necessarily
through io_uring.

> Data is 'read' using a new io_uring request
> type. When done, data is returned via a new shared refill queue. A zero
> copy page pool refills a hw rx queue from this refill queue directly. Of
> course, the lifetime of these data buffers are managed by io_uring
> rather than the networking stack, with different refcounting rules.
>
> This patchset is the first step adding basic zero copy support. We will
> extend this iteratively with new features e.g. dynamically allocated
> zero copy areas, THP support, dmabuf support, improved copy fallback,
> general optimisations and more.
>
> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> aren't included since Taehee Yoo has already sent a more comprehensive
> patchset adding support in [1]. Google gve should already support this,

This is an aside, but GVE supports this via the out-of-tree patches
I've been carrying on github. Uptsream we're working on adding the
prerequisite page_pool support.

> and Mellanox mlx5 support is WIP pending driver changes.
>
> ===========
> Performance
> ===========
>
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
>
> With application thread + net rx softirq pinned to _different_ cores:
>
> epoll
> 82.2 Gbps
>
> io_uring
> 116.2 Gbps (+41%)
>
> Pinned to _same_ core:
>
> epoll
> 62.6 Gbps
>
> io_uring
> 80.9 Gbps (+29%)
>

Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
[1]  and io_uring zerocopy respectively?

If not, I would like to see a comparison between TCP RX zerocopy and
this new io-uring zerocopy. For Google for example we use the TCP RX
zerocopy, I would like to see perf numbers possibly motivating us to
move to this new thing.

[1] https://lwn.net/Articles/752046/


-- 
Thanks,
Mina