linux-kernel - Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ltbgocygx4unco6ssoiszwsgjmztyuxkqja3omvvyqvpii6dac@5abamn33galn>
Date:   Sat, 8 Jul 2023 02:30:16 +0200
From:   Ahelenia Ziemiańska 
        <nabijaczleweli@...ijaczleweli.xyz>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Christian Brauner <brauner@...nel.org>,
        Jens Axboe <axboe@...nel.dk>,
        David Howells <dhowells@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: Pending splice(file -> FIFO) excludes all other FIFO operations
 forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read
 side?)

On Fri, Jul 07, 2023 at 03:57:40PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska
> <nabijaczleweli@...ijaczleweli.xyz> wrote:
> > (inlined by) eat_empty_buffer at fs/splice.c:594
> Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it.
> 
> And the reason for that is actually somewhat interesting: we do have that
> 
>         while (!pipe_readable(pipe)) {
>              ..
> 
> above it, but the logic for this all is that pipes with pipe buffers
> are by *default* considered readable until they try to actually
> confirm the buffer, and at that point they might say "oh, I have to
> return -EAGAIN and set 'not_ready'".
> 
> And that splice_from_pipe_next() doesn't do that.
> 
> End result: it will happily free that pipe buffer that is still in the
> process of being filled.
> 
> The good news is that I think the fix is probably trivial. Something
> like the attached?
> 
> Again - NOT TESTED
Same reproducer, backtrace attached:
$ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
splice_from_pipe_next+0x6e/0x180:
pipe_buf_confirm at include/linux/pipe_fs_i.h:233
(inlined by) eat_empty_buffer at fs/splice.c:597
(inlined by) splice_from_pipe_next at fs/splice.c:647

Looks like the same place.

> > Besides that, this doesn't solve the original issue, inasmuch as
> >   ./v > fifo &
> >   head fifo &
> >   echo zupa > fifo
> > (where ./v splices from an empty pty to stdout; v.c attached)
> > echo still sleeps until ./v dies, though it also succumbs to ^C now.
> Yeah, I concentrated on just making everything interruptible,
> 
> But the fact that the echo has to wait for the previous write to
> finish is kind of fundamental. We can't just magically do writes out
> of order. 'v' is busy writing to the fifo, we can't let some other
> write just come in.
(It's no longer busy writing to it when it gets killed by timeout in my
 second example, but echo still doesn't wake up.)

> (We *could* make the splice in ./v not fill the whole pipe buffer, and
> allow some other writes to fill in buffers afterwards, but at _some_
> point you'll hit the "pipe buffers are full and busy, can't add any
> more without waiting for them to empty").
You are, but, well, that's also the case when the pipe is full.
As it stands, the pipe is /empty/ and yet /no-one can write to it/.
This is the crux of the issue at hand.

(Coincidentally, you're describing what looks quite similar to pt 1.
 from <naljsvzzemr6pjiwwcdpdsdh5vtycdr6fmi2fk2dlr4nn4kq6o@...nbgxhslti>.)

I think we got away with it for so long because most files behave
like regular files/blockdevs and the read is always guaranteed to
complete ~instantly, but splice is, fundamentally, /not/ writing
the whole time because it's not /reading/ the whole time when it's
reading an empty socket/a chardev/whatever.

Or, rather: splice() from a non-seekable (non-mmap()pable?)
fundamentally doesn't really make much sense, because you're not
gluing a bit of the pro-verbial page cache (forgive me my term
abuse here, you see what I'm getting at tho) to the end of the pipe,
you're just telling the kernel to run a read()/write() loop for you.

sendfile() works around this by reading and then separately writing
to its special buffer (in the form of an anonymous process-local pipe).
splice() just raw-dogs the read with the write lock held,
but just doesn't check if it can do it.

That's how it's, honestly, shaking out to me: someone just forgot
a check the first time they wrote it.
Because the precondition for "does reading directly into the pipe
buffer make sense" is "is it directly equivalent to read(f)/write(p)",
and that holds only for seekables.

/Maybe/ for, like, sockets if there's already data, or as a special
case similar to pipe->pipe. But then for sockets you're already
using sendfile(), so?

To that end, I'm including a patch based on
  4f6b6c2b2f86b7878a770736bf478d8a263ff0bc
that does just that: EINVAL.

Maybe if you're worried about breaking compatibility
(which idk if it's an issue since splice and sendfile
 are temperamental w.r.t. what they take anyway across versions;
 you need a read()/write() fallback anyway)
that case could become sendfile-like by first reading into a buffer
once then pipe->pipe splicing out of it separately.
Though, even the usual sendfile read/write loop only works on seekables.

> One thing we could possibly do is to say that we just don't accept any
> new writes if there are old busy splices in process. So we could make
> new writes return -EBUSY or something, I guess.
Same logic as above applies + no-one's really expecting EBUSY +
what /would/ you do about EBUSY in this case?
-- >8 --
From 552d93bee8e1b8333ce84ed0ca8490f6712c5b8b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?=
 <nabijaczleweli@...ijaczleweli.xyz>
Date: Sat, 8 Jul 2023 01:47:59 +0200
Subject: [PATCH] splice: file->pipe: -EINVAL if file non-seekable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Both the logical semantic of "tie a page from the page cache to the pipe"
and the implemented semantic of "lock the pipe, then read into it"
(thus holding the write lock for as as long as the read is outstanding)
only make sense if the read is guaranteed to complete instantly.

This has been a long-standing omission and, when the read doesn't
immediately complete (because it's a tty, socket, &c.), the write lock
‒ which for pipes is a pipe-global mutex ‒ is held until,
thus excluding all mutual users of the pipe, until it does.

Refuse it. Use read()/write() in userspace instead of getting the kernel
to do it for you, badly, when there's no point to doing so.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEC_Rh8+-rtEi8C45upO-Ffw=M_i1211qS_3AvWZCbOg@mail.gmail.com/t/#u
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@...ijaczleweli.xyz>
---
 fs/splice.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..14cf3cea1091 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1309,6 +1309,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 	if (opipe) {
 		if (off_out)
 			return -ESPIPE;
+		if (!(in->f_mode & FMODE_LSEEK))
+			return -EINVAL;
 		if (off_in) {
 			if (!(in->f_mode & FMODE_PREAD))
 				return -EINVAL;
-- 
2.39.2


View attachment "BUG" of type "text/plain" (3993 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)