linux-kernel - Re: [PATCH v3] fs/splice: don't block splice_direct_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240604104151.73n3zmn24hxmmwj6@quack3>
Date: Tue, 4 Jun 2024 12:41:51 +0200
From: Jan Kara <jack@...e.cz>
To: Max Kellermann <max.kellermann@...os.com>
Cc: axboe@...nel.dk, brauner@...nel.org, viro@...iv.linux.org.uk,
	hch@...radead.org, jack@...e.cz, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] fs/splice: don't block splice_direct_to_actor() after
 data was read

On Tue 04-06-24 11:24:31, Max Kellermann wrote:
> If userspace calls sendfile() with a very large "count" parameter, the
> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
> have been read from the hard disk and pushed into the socket buffer.
> 
> Usually, that is not a problem, because the socket write buffer gets
> filled quickly, and if the socket is non-blocking, the last
> direct_splice_actor() call will return -EAGAIN, causing
> splice_direct_to_actor() to break from the loop, and sendfile() will
> return a partial transfer.
> 
> However, if the network happens to be faster than the hard disk, and
> the socket buffer keeps getting drained between two
> generic_file_read_iter() calls, the sendfile() system call can keep
> running for a long time, blocking for disk I/O over and over.
> 
> That is undesirable, because it can block the calling process for too
> long.  I discovered a problem where nginx would block for so long that
> it would drop the HTTP connection because the kernel had just
> transferred 2 GiB in one call, and the HTTP socket was not writable
> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
> 
>   sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
>   sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
>   epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
>   gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
>   sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
>   writev(4, [], 0) = 0 <0.000439>
>   epoll_wait(9, [], 512, 60000) = 0 <60.060430>
>   gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
>   write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
>   close(12) = 0 <0.000063>
>   close(4)  = 0 <0.000091>
> 
> In newer nginx versions (since 1.21.4), this problem was worked around
> by defaulting "sendfile_max_chunk" to 2 MiB:
> 
>  https://github.com/nginx/nginx/commit/5636e7f7b4

Well, I can see your pain but after all the kernel does exactly what
userspace has asked for?

> Instead of asking userspace to provide an artificial upper limit, I'd
> like the kernel to block for disk I/O at most once, and then pass back
> control to userspace.
> 
> There is prior art for this kind of behavior in filemap_read():
> 
> 	/*
> 	 * If we've already successfully copied some data, then we
> 	 * can no longer safely return -EIOCBQUEUED. Hence mark
> 	 * an async read NOWAIT at that point.
> 	 */
> 	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
> 		iocb->ki_flags |= IOCB_NOWAIT;

Do note the IOCB_WAITQ test - this means that the request is coming from
io_uring and that has a retry logic to handle short reads. I'm really
nervous about changing sendfile behavior to unconditionally returning short
writes. In particular see this in do_sendfile():

#if 0
        /*
         * We need to debate whether we can enable this or not. The
         * man page documents EAGAIN return for the output at least,
         * and the application is arguably buggy if it doesn't expect
         * EAGAIN on a non-blocking file descriptor.
         */
        if (in.file->f_flags & O_NONBLOCK)
                fl = SPLICE_F_NONBLOCK;
#endif

We have been through these discussions in the past and so far we have
always decided the chances for userspace breakage are too big. After all
there's no substantial difference between userspace issuing a 2GB read(2) and
2GB sendfile(2). In both cases you are going to be blocked for a long
time and there are too many userspace applications that depend on this
behavior...

So I don't think we want to change the current behavior. We could create a
new interface to provide the semantics you want (like a flag to sendfile(2)
- which would mean a new syscall) but I'm wondering whether these days
io_uring isn't a better answer if you want to more closely control IO
behavior of your application and are willing to change used interface.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR