linux-kernel - Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1111403.1750688218@warthog.procyon.org.uk>
Date: Mon, 23 Jun 2025 15:16:58 +0100
From: David Howells <dhowells@...hat.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: dhowells@...hat.com, Andrew Lunn <andrew@...n.ch>,
    Eric Dumazet <edumazet@...gle.com>,
    "David
 S. Miller" <davem@...emloft.net>,
    Jakub Kicinski <kuba@...nel.org>,
    David Hildenbrand <david@...hat.com>,
    John Hubbard <jhubbard@...dia.com>, willy@...radead.org,
    Christian Brauner <brauner@...nel.org>,
    Al Viro <viro@...iv.linux.org.uk>,
    Miklos Szeredi <mszeredi@...hat.com>, torvalds@...ux-foundation.org,
    netdev@...r.kernel.org, linux-mm@...ck.org,
    linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN

Christoph Hellwig <hch@...radead.org> wrote:

> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor?  Particularly if, say, it's just a small part of a
> > larger span.
> 
> What is a "span" in this context?

In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory.  Basically just a contiguous range
of physical addresses, if you prefer.

However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes.  Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.

Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called.  The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.

Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).

And then there's relayfs and fuse, which seem to do weird stuff.

For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point.  The problem is that the
kernel doesn't know what's going to happen next to it.

> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery.  But that is configurable through the
> pipe_buf_operations.  So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it.  And you'd better have a really good use case for this to be worthwhile.

Yes.  vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.

David