netdev - Re: Bug in short splice to socket?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <832277.1685630048@warthog.procyon.org.uk>
Date: Thu, 01 Jun 2023 15:34:08 +0100
From: David Howells <dhowells@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: dhowells@...hat.com, Jakub Kicinski <kuba@...nel.org>,
    netdev@...r.kernel.org, "David S.
 Miller" <davem@...emloft.net>,
    Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
    Willem de Bruijn <willemdebruijn.kernel@...il.com>,
    David Ahern <dsahern@...nel.org>,
    Matthew Wilcox <willy@...radead.org>, Jens Axboe <axboe@...nel.dk>,
    linux-mm@...ck.org, linux-kernel@...r.kernel.org,
    Chuck Lever <chuck.lever@...cle.com>,
    Boris Pismenny <borisp@...dia.com>,
    John Fastabend <john.fastabend@...il.com>,
    Christoph Hellwig <hch@...radead.org>
Subject: Re: Bug in short splice to socket?


Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > The reason the old code is garbage is that it sets SPLICE_F_MORE
> > entirely in the wrong place. It sets it *after* it has done the
> > splice(). That's just crazy.
> 
> Clarification, because there are two splice's (from and to): by "after
> the splice" I mean after the splice source, of course. It's still set
> before the splice _to_ the network.
> 
> (But it is still true that I hope the networking code itself then sets
> MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
> bigger than what it can handle right now - so there are two
> *different* reasons for "more data" - either the source knows it has
> more to give, or the destination knows it didn't use everything it
> got).

I'm in the process of changing things so that ->sendpage() is removed and
replaced with sendmsg() with MSG_SPLICE_PAGES.  The aim is to end up with a
splice_to_socket() function (see attached) that transcribes a chunk of the
pipe into a BVEC iterator and does a single sendmsg() that pushes a whole
chunk of data to the socket in one go.

In the network protocol, the loop inside sendmsg splices those pages into
socket buffers, sleeping as necessary to gain sufficient memory to transcribe
all of them.  It can return partly done and the fully consumed pages will be
consumed and then it'll return to gain more data.

At the moment, it transcribes 16 pages at a time.  I could make it set
MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
there's yet more data in the buffer.

However, this might well cause a malfunction in UDP, for example.  MSG_MORE
corks the current packet, so if I ask sendfile() say shove 32K into a packet,
if, say, 16K is read from the source and entirely transcribed into the packet,
if I understand what you're proposing, MSG_MORE wouldn't get set and the
packet would be transmitted early.  A similar issue might exist for AF_TLS,
where the lack of MSG_MORE triggers an end-of-record.

The problem isn't that the buffer supplied is bigger than the protocol could
handle in one go, it's that what we read was less than the amount that the
user requested - but we don't know if there will be more data.

One solution might be to pass MSG_MORE regardless, and then follow up with a
null sendmsg() if we then hit a zero-length read (ie. EOF) or -ENODATA (ie. a
hole).

> The point is that the splice *source* knows whether there is more data
> to be had, and that's where the "there is more" should be set.

Actually, that's not necessarily true.  If the source is a pipe or a socket,
there may not be more, but we don't know that until the far end is closed -
but for a file it probably is (we could be racing with a writer).  And then
there's seqfile-based things like the trace buffer or procfiles which are
really a class of their own.

> Basically my argument is that the whole "there is more data" should be
> set by "->splice_read()" not by some hack in some generic
> splice_direct_to_actor() function that is absolutely not supposed to
> know about the internals of the source or the destination.
> 
> Do we have a good interface for that? No. I get the feeling that to
> actually fix this, we'd have to pass in the 'struct splice_desc"
> pointer to ->splice_read().

That might become more feasible, actually.  In Jens's tree are the patches to
clean up splice_read a lot and get rid of ITER_PIPE.  Most of the
->splice_reads are going to be direct calls to copy_splice_read() and
filemap_splice_read() or wrappers around filemap_splice_read().  The latter,
at least, might be in a good position to note EOF, perhaps by marking a pipe
buffer as "no more".

copy_splice_read() is more problematic because ->read_iter() doesn't indicate
if it hit EOF (or hit a hole).

David
---
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, mask, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_empty(pipe->head, pipe->tail)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;
		mask = pipe->ring_size - 1;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);
			seg = min_t(size_t, seg, PAGE_SIZE);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (seg >= buf->len)
				tail++;
			if (bc >= ARRAY_SIZE(bvec))
				break;
		}

		if (!bc)
			break;

		msg.msg_flags = 0;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags = MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags = MSG_MORE;
		msg.msg_flags |= MSG_SPLICE_PAGES;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}