[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1111403.1750688218@warthog.procyon.org.uk>
Date: Mon, 23 Jun 2025 15:16:58 +0100
From: David Howells <dhowells@...hat.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: dhowells@...hat.com, Andrew Lunn <andrew@...n.ch>,
Eric Dumazet <edumazet@...gle.com>,
"David
S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
David Hildenbrand <david@...hat.com>,
John Hubbard <jhubbard@...dia.com>, willy@...radead.org,
Christian Brauner <brauner@...nel.org>,
Al Viro <viro@...iv.linux.org.uk>,
Miklos Szeredi <mszeredi@...hat.com>, torvalds@...ux-foundation.org,
netdev@...r.kernel.org, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
Christoph Hellwig <hch@...radead.org> wrote:
> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor? Particularly if, say, it's just a small part of a
> > larger span.
>
> What is a "span" in this context?
In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory. Basically just a contiguous range
of physical addresses, if you prefer.
However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes. Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.
Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called. The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.
Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).
And then there's relayfs and fuse, which seem to do weird stuff.
For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point. The problem is that the
kernel doesn't know what's going to happen next to it.
> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery. But that is configurable through the
> pipe_buf_operations. So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it. And you'd better have a really good use case for this to be worthwhile.
Yes. vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.
David
Powered by blists - more mailing lists