netdev - Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bb31c07a-0b70-4bca-9c59-42f6233791cd@redhat.com>
Date: Mon, 12 May 2025 23:59:24 +0200
From: David Hildenbrand <david@...hat.com>
To: David Howells <dhowells@...hat.com>, Andrew Lunn <andrew@...n.ch>
Cc: Eric Dumazet <edumazet@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 John Hubbard <jhubbard@...dia.com>, Christoph Hellwig <hch@...radead.org>,
 willy@...radead.org, Christian Brauner <brauner@...nel.org>,
 Al Viro <viro@...iv.linux.org.uk>, Miklos Szeredi <mszeredi@...hat.com>,
 torvalds@...ux-foundation.org, netdev@...r.kernel.org, linux-mm@...ck.org,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN

On 12.05.25 16:51, David Howells wrote:
> I'm looking at how to make sendmsg() handle page pinning - and also working
> towards supporting the page refcount eventually being removed and only being
> available with certain memory types.
> 
> One of the outstanding issues is in sendmsg().  Analogously with DIO writes,
> sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting
> refs on it before it attaches it to an sk_buff.  Without this, if memory is
> spliced into an AF_UNIX socket and then the process forks, that memory gets
> attached to the child process, and the child can alter the data

That should not be possible. Neither the child nor the parent can modify 
the page. Any write attempt will result in Copy-on-Write.

The issue is that if the parent writes to some unrelated part of the 
page after fork() but before DIO completed, the parent will trigger 
Copy-on-Write and the DIO will essentially be lost from the parent's POV 
(goes to the wrong page).

> probably by
> accident, if the memory is on the stack or in the heap.
> 
> Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to
> an AF_UNIX pipe (though I'm not sure if anyone actually does this).
> 
> (For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I
> think we're probably fine - assuming the loopback driver doesn't give the
> receiver the transmitter's buffers to use directly...  This may be a big
> 'if'.)
> 
> Now, this probably wouldn't be a problem, but for the fact that one can also
> splice this stuff back *out* of the socket.
> 
> The same issues exist for pipes too.
> 
> The question is what should happen here to a memory span for which the network
> layer or pipe driver is not allowed to take reference, but rather must call a
> destructor?  Particularly if, say, it's just a small part of a larger span.
> 
> It seems reasonable that we should allow pinned memory spans to be queued in a
> socket or a pipe - that way, we only have to copy the data once in the event
> that the data is extracted with read(), recvmsg() or similar.  But if it's
> spliced out we then have all the fun of managing the lifetime - especially if
> it's a big transfer that gets split into bits.  In such a case, I wonder if we
> can just duplicate the memory at splice-out rather than trying to keep all the
> tracking intact.
> 
> If the memory was copied in, then moving the pages should be fine - though the
> memory may not be of a ref'able type (which would be fun if bits of such a
> page get spliced to different places).
> 
> I'm sure there is some app somewhere (fuse maybe?) where this would be a
> performance problem, though.
> 
> And then there's vmsplice().  The same goes for vmsplice() to AF_UNIX or to a
> pipe.  That should also pin memory.  It may also be possible to vmsplice a
> pinned page into the target process's VM or a page from a memory span with
> some other type of destruction.

IIRC, vmsplice() never does that optimization for that direction (map 
pinned page into the target process). It would be a mess.

But yes, vmsplice() should be using FOLL_PIN|FOLL_LONGTERM. Deprecation 
is unlikely to happen, I'm afraid :(

-- 
Cheers,

David / dhildenb