netdev - AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2135907.1747061490@warthog.procyon.org.uk>
Date: Mon, 12 May 2025 15:51:30 +0100
From: David Howells <dhowells@...hat.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: dhowells@...hat.com, Eric Dumazet <edumazet@...gle.com>,
    "David S. Miller" <davem@...emloft.net>,
    Jakub Kicinski <kuba@...nel.org>,
    David Hildenbrand <david@...hat.com>,
    John Hubbard <jhubbard@...dia.com>,
    Christoph Hellwig <hch@...radead.org>, willy@...radead.org,
    Christian Brauner <brauner@...nel.org>,
    Al Viro <viro@...iv.linux.org.uk>,
    Miklos Szeredi <mszeredi@...hat.com>, torvalds@...ux-foundation.org,
    netdev@...r.kernel.org, linux-mm@...ck.org,
    linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN

I'm looking at how to make sendmsg() handle page pinning - and also working
towards supporting the page refcount eventually being removed and only being
available with certain memory types.

One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
refs on it before it attaches it to an sk_buff.  Without this, if memory is
spliced into an AF_UNIX socket and then the process forks, that memory gets
attached to the child process, and the child can alter the data, probably by
accident, if the memory is on the stack or in the heap.

Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
an AF_UNIX pipe (though I'm not sure if anyone actually does this).

(For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
think we're probably fine - assuming the loopback driver doesn't give the
receiver the transmitter's buffers to use directly...  This may be a big
'if'.)

Now, this probably wouldn't be a problem, but for the fact that one can also
splice this stuff back *out* of the socket.

The same issues exist for pipes too.

The question is what should happen here to a memory span for which the network
layer or pipe driver is not allowed to take reference, but rather must call a
destructor?  Particularly if, say, it's just a small part of a larger span.

It seems reasonable that we should allow pinned memory spans to be queued in a
socket or a pipe - that way, we only have to copy the data once in the event
that the data is extracted with read(), recvmsg() or similar.  But if it's
spliced out we then have all the fun of managing the lifetime - especially if
it's a big transfer that gets split into bits.  In such a case, I wonder if we
can just duplicate the memory at splice-out rather than trying to keep all the
tracking intact.

If the memory was copied in, then moving the pages should be fine - though the
memory may not be of a ref'able type (which would be fun if bits of such a
page get spliced to different places).

I'm sure there is some app somewhere (fuse maybe?) where this would be a
performance problem, though.

And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
pipe.  That should also pin memory.  It may also be possible to vmsplice a
pinned page into the target process's VM or a page from a memory span with
some other type of destruction.  I don't suppose we can deprecate vmsplice()?

David