netdev - Device mem changes vs pinning/zerocopy changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <770012.1748618092@warthog.procyon.org.uk>
Date: Fri, 30 May 2025 16:14:52 +0100
From: David Howells <dhowells@...hat.com>
To: Mina Almasry <almasrymina@...gle.com>
cc: dhowells@...hat.com, willy@...radead.org, hch@...radead.org,
    Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <edumazet@...gle.com>,
    netdev@...r.kernel.org, linux-mm@...ck.org,
    linux-kernel@...r.kernel.org
Subject: Device mem changes vs pinning/zerocopy changes

Hi Mina,

I've seen your transmission-side TCP devicemem stuff has just gone in and it
conflicts somewhat with what I'm trying to do.  I think you're working on the
problem bottom up and I'm working on it top down, so if you're willing to
collaborate on it...?

So, to summarise what we need to change (you may already know all of this):

 (*) The refcount in struct page is going to go away.  The sk_buff fragment
     wrangling code, however, occasionally decides to override the zerocopy
     mode and grab refs on the pages pointed to by those fragments.  sk_buffs
     *really* want those page refs - and it does simplify memory handling.
     But.

     Anyway, we need to stop taking refs where possible.  A fragment may in
     future point to a sequence of pages and we would only be getting a ref on
     one of them.

 (*) Further, the page struct is intended to be slimmed down to a single typed
     pointer if possible, so all the metadata in the net_iov struct will have
     to be separately allocated.

 (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the user
     pages specified by the iterator but we need to stop doing that.  We need
     to call GUP to take a "pin" instead (and must not take any refs).  The
     pages we get access to may be folio-type, anon-type, some sort of device
     type.

 (*) It would be good to do a batch lookup of user buffers to cut down on the
     number of page table trawls we do - but, on the other hand, that might
     generate more page faults upfront.

 (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they give
     us buffers from a pipe - but the buffers come with destructors and should
     not have refs taken on the pages we might think they have, but use the
     destructor instead.

 (*) The intention is to change struct bio_vec to be just physical address and
     length, with no page pointer.  You'd then use, say, kmap_local_phys() or
     kmap_local_bvec() to access the contents from the cpu.  We could then
     revert the fragment pointers to being bio_vecs.

 (*) Kernel services, such as network filesystems, can't pass kmalloc()'d data
     to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and, in
     any case, the object lifetime is not managed by refcount.  However, if we
     had a destructor, this restriction could go away.


So what I'd like to do is:

 (1) Separate fragment lifetime management from sk_buff.  No more wangling of
     refcounts in the skbuff code.  If you clone an skb, you stick an extra
     ref on the lifetime management struct, not the page.

 (2) Create a chainable 'network buffer' struct, e.g.:

	enum net_txbuf_type {
		NET_TXBUF_BUFFERED,	/* Buffered copy of data */
		NET_TXBUF_ZCOPY_USER,	/* Zerocopy of user buffers */
		NET_TXBUF_ZCOPY_KERNEL,	/* Zerocopy of kernel buffers */
	};

	struct net_txbuf {
		struct net_txbuf	next;
		struct mmpin		mm_pin;
		unsigned int		start_pos;
		unsigned int		end_pos;
		unsigned int		extracted_to;
		refcount_t		ref;
		enum net_txbuf_type	type;
		u8			nr_used;
		bool			wmem_charged;
		bool			got_copied;
		union {
			/* For NET_TXBUF_BUFFERED: */
			struct {
				void		*bufs[16];
				u8		bufs_orders[16];
				bool		last_buf_freeable;
			};
			/* For NET_TXBUF_ZCOPY_*: */
			struct {
				struct sock	*sk;
				struct sk_buff	*notify;
				msg_completion_t completion;
				void		*completion_data;
				struct bio_vec	frags[12];
			};
		};
	};

     (Note this is very much still a WiP and very much subject to change)

     So how I envision it working depends on the type of flow in the socket.
     For the transmission side of streaming sockets (e.g. TCP), the socket
     maintains a single chain of these.  Each txbuf is of a single type, but
     multiple types can be interleaved.

     For non-ZC flow, as data is imported, it's copied into pages attached to
     the current head txbuf of type BUFFERED, with more pages being attached
     as we progress.  Successive writes just keep adding to the space in the
     latest page added and each skbuff generated pins the txbuf it starts at
     and each txbuf pins its successor.

     As skbuffs are consumed, they unpin the root txbuf.  However, this could
     leave an awful lot of memory pinned for a long time, so I would mitigate
     this in two ways: firstly, where possible, keep track of the transmitted
     byte position and progressively destruct the txbuf; secondly, if we
     completely use up a partially filled txbuf then reset the queue.

     An skbuff's frag list then has a bio_vec[] that refers to fragments of
     the buffers recorded in the txbuf chain.  An skbuff may span multiple
     txbufs and a txbuf may provision multiple skbuffs.

     For the transmission side of datagram sockets (e.g. UDP) where the
     messages may complete out of order, I think I would give each datagram
     its own series of txbufs, but link the tails together to manage the
     SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with userspace.
     If dealing with the kernel, there's no need to link them together as the
     kernel can provide a destructor for each datagram.

 (3) When doing zerocopy from userspace, do calls to GUP to get batches of
     non-contiguous pages into a bio_vec array.

 (4) Because AF_UNIX and the loopback driver transfer packets from the
     transmission queue of one socket down into the reception queue of
     another, the use of txbufs would also need to extend onto the receive
     side (and so "txbufs" would be a misnomer).

     When receiving a packet, a txbuf would need to be allocated and the
     received buffers attached to it.  The pages wouldn't necessarily need
     refcounts as the txbuf holds them.  The skbuff holds a ref on the txbuf.

 (5) Cloning an skbuff would involve just taking an extra ref on the first
     txbuf.  Splitting off part of an skbuff would involve fast-forwarding the
     txbuf chain for the second part and pinning that.

 (6) I have a chained-bio_vec array concept with iov_iter type for it that
     might make it easier to string together the fragments in a reassembled
     packet and represent it as an iov_iter, thereby allowing us to use common
     iterator routines for things like ICMP and packet crypto.

 (7) We need to separate net_iov from struct page, and it might make things
     easier if we do that now, allocating net_iov from a slab.

 (8) Reference the txbuf in a splice and provide a destructor that drops that
     reference.  For small splices, I'd be very tempted to simply copy the
     data.  For splice-out of data that was spliced into an AF_UNIX socket or
     zerocopy data that passed through a loopback device, I'm also very
     tempted to make splice copy at that point.  There's a potential DoS
     attack whereby someone can endlessly splice tiny bits of a message or
     just sit on them, preventing the original provider from recovering its
     memory.

 (9) Make it easy for a network filesystem to create an entire compound
     message and present it to the socket in a single sendmsg() with a
     destructor.

I've pushed my current changes (very incomplete as they are) to:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental

I'm writing functions to abstract out the loading of data into the txbuf chain
and attach to skbuff.  These can be found in skbuff.c as net_txbuf_*().  I've
modified the TCP sendmsg to use them.

David