[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <770012.1748618092@warthog.procyon.org.uk>
Date: Fri, 30 May 2025 16:14:52 +0100
From: David Howells <dhowells@...hat.com>
To: Mina Almasry <almasrymina@...gle.com>
cc: dhowells@...hat.com, willy@...radead.org, hch@...radead.org,
Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <edumazet@...gle.com>,
netdev@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Device mem changes vs pinning/zerocopy changes
Hi Mina,
I've seen your transmission-side TCP devicemem stuff has just gone in and it
conflicts somewhat with what I'm trying to do. I think you're working on the
problem bottom up and I'm working on it top down, so if you're willing to
collaborate on it...?
So, to summarise what we need to change (you may already know all of this):
(*) The refcount in struct page is going to go away. The sk_buff fragment
wrangling code, however, occasionally decides to override the zerocopy
mode and grab refs on the pages pointed to by those fragments. sk_buffs
*really* want those page refs - and it does simplify memory handling.
But.
Anyway, we need to stop taking refs where possible. A fragment may in
future point to a sequence of pages and we would only be getting a ref on
one of them.
(*) Further, the page struct is intended to be slimmed down to a single typed
pointer if possible, so all the metadata in the net_iov struct will have
to be separately allocated.
(*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
pages specified by the iterator but we need to stop doing that. We need
to call GUP to take a "pin" instead (and must not take any refs). The
pages we get access to may be folio-type, anon-type, some sort of device
type.
(*) It would be good to do a batch lookup of user buffers to cut down on the
number of page table trawls we do - but, on the other hand, that might
generate more page faults upfront.
(*) Splice and vmsplice. If only I could uninvent them... Anyway, they give
us buffers from a pipe - but the buffers come with destructors and should
not have refs taken on the pages we might think they have, but use the
destructor instead.
(*) The intention is to change struct bio_vec to be just physical address and
length, with no page pointer. You'd then use, say, kmap_local_phys() or
kmap_local_bvec() to access the contents from the cpu. We could then
revert the fragment pointers to being bio_vecs.
(*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
any case, the object lifetime is not managed by refcount. However, if we
had a destructor, this restriction could go away.
So what I'd like to do is:
(1) Separate fragment lifetime management from sk_buff. No more wangling of
refcounts in the skbuff code. If you clone an skb, you stick an extra
ref on the lifetime management struct, not the page.
(2) Create a chainable 'network buffer' struct, e.g.:
enum net_txbuf_type {
NET_TXBUF_BUFFERED, /* Buffered copy of data */
NET_TXBUF_ZCOPY_USER, /* Zerocopy of user buffers */
NET_TXBUF_ZCOPY_KERNEL, /* Zerocopy of kernel buffers */
};
struct net_txbuf {
struct net_txbuf next;
struct mmpin mm_pin;
unsigned int start_pos;
unsigned int end_pos;
unsigned int extracted_to;
refcount_t ref;
enum net_txbuf_type type;
u8 nr_used;
bool wmem_charged;
bool got_copied;
union {
/* For NET_TXBUF_BUFFERED: */
struct {
void *bufs[16];
u8 bufs_orders[16];
bool last_buf_freeable;
};
/* For NET_TXBUF_ZCOPY_*: */
struct {
struct sock *sk;
struct sk_buff *notify;
msg_completion_t completion;
void *completion_data;
struct bio_vec frags[12];
};
};
};
(Note this is very much still a WiP and very much subject to change)
So how I envision it working depends on the type of flow in the socket.
For the transmission side of streaming sockets (e.g. TCP), the socket
maintains a single chain of these. Each txbuf is of a single type, but
multiple types can be interleaved.
For non-ZC flow, as data is imported, it's copied into pages attached to
the current head txbuf of type BUFFERED, with more pages being attached
as we progress. Successive writes just keep adding to the space in the
latest page added and each skbuff generated pins the txbuf it starts at
and each txbuf pins its successor.
As skbuffs are consumed, they unpin the root txbuf. However, this could
leave an awful lot of memory pinned for a long time, so I would mitigate
this in two ways: firstly, where possible, keep track of the transmitted
byte position and progressively destruct the txbuf; secondly, if we
completely use up a partially filled txbuf then reset the queue.
An skbuff's frag list then has a bio_vec[] that refers to fragments of
the buffers recorded in the txbuf chain. An skbuff may span multiple
txbufs and a txbuf may provision multiple skbuffs.
For the transmission side of datagram sockets (e.g. UDP) where the
messages may complete out of order, I think I would give each datagram
its own series of txbufs, but link the tails together to manage the
SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with userspace.
If dealing with the kernel, there's no need to link them together as the
kernel can provide a destructor for each datagram.
(3) When doing zerocopy from userspace, do calls to GUP to get batches of
non-contiguous pages into a bio_vec array.
(4) Because AF_UNIX and the loopback driver transfer packets from the
transmission queue of one socket down into the reception queue of
another, the use of txbufs would also need to extend onto the receive
side (and so "txbufs" would be a misnomer).
When receiving a packet, a txbuf would need to be allocated and the
received buffers attached to it. The pages wouldn't necessarily need
refcounts as the txbuf holds them. The skbuff holds a ref on the txbuf.
(5) Cloning an skbuff would involve just taking an extra ref on the first
txbuf. Splitting off part of an skbuff would involve fast-forwarding the
txbuf chain for the second part and pinning that.
(6) I have a chained-bio_vec array concept with iov_iter type for it that
might make it easier to string together the fragments in a reassembled
packet and represent it as an iov_iter, thereby allowing us to use common
iterator routines for things like ICMP and packet crypto.
(7) We need to separate net_iov from struct page, and it might make things
easier if we do that now, allocating net_iov from a slab.
(8) Reference the txbuf in a splice and provide a destructor that drops that
reference. For small splices, I'd be very tempted to simply copy the
data. For splice-out of data that was spliced into an AF_UNIX socket or
zerocopy data that passed through a loopback device, I'm also very
tempted to make splice copy at that point. There's a potential DoS
attack whereby someone can endlessly splice tiny bits of a message or
just sit on them, preventing the original provider from recovering its
memory.
(9) Make it easy for a network filesystem to create an entire compound
message and present it to the socket in a single sendmsg() with a
destructor.
I've pushed my current changes (very incomplete as they are) to:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental
I'm writing functions to abstract out the loading of data into the txbuf chain
and attach to skbuff. These can be found in skbuff.c as net_txbuf_*(). I've
modified the TCP sendmsg to use them.
David
Powered by blists - more mailing lists