[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12393cd2-4b09-4956-fff0-93ef3929ee37@kernel.org>
Date: Sun, 16 Jul 2023 19:41:28 -0700
From: Andy Lutomirski <luto@...nel.org>
To: Mina Almasry <almasrymina@...gle.com>, linux-kernel@...r.kernel.org,
linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linaro-mm-sig@...ts.linaro.org, netdev@...r.kernel.org,
linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org
Cc: Sumit Semwal <sumit.semwal@...aro.org>,
Christian König <christian.koenig@....com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>, Arnd Bergmann
<arnd@...db.de>, David Ahern <dsahern@...nel.org>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Shuah Khan <shuah@...nel.org>, jgg@...pe.ca
Subject: Re: [RFC PATCH 00/10] Device Memory TCP
On 7/10/23 15:32, Mina Almasry wrote:
> * TL;DR:
>
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
(I'm writing this as someone who might plausibly use this mechanism, but
I don't think I'm very likely to end up working on the kernel side,
unless I somehow feel extremely inspired to implement it for i40e.)
I looked at these patches and the GVE tree, and I'm trying to wrap my
head around the data path. As I understand it, for RX:
1. The GVE driver notices that the queue is programmed to use devmem,
and it programs the NIC to copy packet payloads to the devmem that has
been programmed.
2. The NIC receives the packet and copies the header to kernel memory
and the payload to dma-buf memory.
3. The kernel tells userspace where in the dma-buf the data is.
4. Userspace does something with the data.
5. Userspace does DONTNEED to recycle the memory and make it available
for new received packets.
Did I get this right?
This seems a bit awkward if there's any chance that packets not intended
for the target device end up in the rxq.
I'm wondering if a more capable if somewhat higher latency model could
work where the NIC stores received packets in its own device memory.
Then userspace (or the kernel or a driver or whatever) could initiate a
separate DMA from the NIC to the final target *after* reading the
headers. Can the hardware support this?
Another way of putting this is: steering received data to a specific
device based on the *receive queue* forces the logic selecting a
destination device to be the same as the logic selecting the queue. RX
steering logic is pretty limited on most hardware (as far as I know --
certainly I've never had much luck doing anything especially intelligent
with RX flow steering, and I've tried on a couple of different brands of
supposedly fancy NICs). But Linux has very nice capabilities to direct
packets, in software, to where they are supposed to go, and it would be
nice if all that logic could just work, scalably, with device memory.
If Linux could examine headers *before* the payload gets DMAed to
wherever it goes, I think this could plausibly work quite nicely. One
could even have an easy-to-use interface in which one directs a *socket*
to a PCIe device. I expect, although I've never looked at the
datasheets, that the kernel could even efficiently make rx decisions
based on data in device memory on upcoming CXL NICs where device memory
could participate in the host cache hierarchy.
My real ulterior motive is that I think it would be great to use an
ability like this for DPDK-like uses. Wouldn't it be nifty if I could
open a normal TCP socket, then, after it's open, ask the kernel to
kindly DMA the results directly to my application memory (via udmabuf,
perhaps)? Or have a whole VLAN or macvlan get directed to a userspace
queue, etc?
It also seems a bit odd to me that the binding from rxq to dma-buf is
established by programming the dma-buf. This makes the security model
(and the mental model) awkward -- this binding is a setting on the
*queue*, not the dma-buf, and in a containerized or privilege-separated
system, a process could have enough privilege to make a dma-buf
somewhere but not have any privileges on the NIC. (And may not even
have the NIC present in its network namespace!)
--Andy
Powered by blists - more mailing lists