lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org>
Date:   Thu, 9 Feb 2023 14:55:59 +0100
From:   Stefan Metzmacher <metze@...ba.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Linux API Mailing List <linux-api@...r.kernel.org>,
        io-uring <io-uring@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Samba Technical <samba-technical@...ts.samba.org>
Subject: copy on write for splice() from file to pipe?

Hi Linus and others,

as written in a private mail before, I'm currently trying to
make use of IORING_OP_SPLICE in order to get zero copy support
in Samba.

The most important use cases are 8 Mbytes reads and writes to
files. where "memcpy" (at the lowest end copy_user_enhanced_fast_string())
is the obvious performance killer.

I have a prototype that offers really great performance
avoiding "memcpy" by using splice() (in order to get async IORING_OP_SPLICE).

So we have two cases:

1. network -> socket -> splice -> pipe -> splice -> file -> storage

2. storage -> file -> splice -> pipe -> splice -> socket -> network

With 1. I guess everything can work reliable, once
the pages are created/filled in the socket receive buffer
they are used exclusively and they won't be shared on
the way to the file. Which means we can be sure that
data arrives unmodified in the file(system).

But with 2. there's a problem, as the pages from the file,
which are spliced into the pipe are still shared without
copy on write with the file(system). It means writes to the file
after the first splice modify the content of the spliced pages!
So the content may change uncontrolled before it reaches the network!
I created a little example that demonstrates the problem (see below),
it gives the following output:

> open(O_TMPFILE) => ffd[3]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[4096]
> pipe() => ret[0]
> splice(count=PIPE_BUF*2,ofs=0) sret[8192]
> pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[4096]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[4096]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=0, expecting 0x00 => ret[240]
> memcmp() at ofs=0, checking for 0xf0 => ret[0]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[209]
> memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[0]

After reading from the pipe we get the values we have written to
the file instead of the values we had at the time of splice.

For things like web servers, which mostly serve static content, this
isn't a problem, but it is for Samba, when reads and writes may happen within
microseconds, before the content is pushed to the network.

I'm wondering if there's a possible way out of this, maybe triggered by a new
flag passed to splice.

I looked through the code and noticed the existence of IOMAP_F_SHARED.
Maybe the splice from the page cache to the pipe could set IOMAP_F_SHARED,
while incrementing the refcount and the network driver could remove it again
when the refcount reaches 1 again.

Is there any other way we could archive something like this?

In addition and/or as alternative I was thinking about a flag to
preadv2() (and IORING_OP_READV) to indicate the use of something
like async_memcpy(), instead of doing the copy via the cpu.
That in combination with IORING_OP_SENDMSG_ZC would avoid "memcpy"
on the cpu.

Any hints, remarks and prototype patches are highly welcome.

Thanks!
metze

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <limits.h>

int main(void)
{
	int ffd;
	int pfds[2];
	char buf [PIPE_BUF] = {0, };
	char buf2 [PIPE_BUF] = {0, };
	ssize_t sret;
	int ret;
	off_t ofs;

	memset(buf, 0x1f, PIPE_BUF);

	ffd = open("/tmp/", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR);
	printf("open(O_TMPFILE) => ffd[%d]\n", ffd);

	sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
	printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[%zd]\n", sret);

	ret = pipe(pfds);
	printf("pipe() => ret[%d]\n", ret);

	ofs = 0;
	sret = splice(ffd, &ofs, pfds[1], NULL, PIPE_BUF*2, 0);
	printf("splice(count=PIPE_BUF*2,ofs=0) sret[%zd]\n", sret);

	memset(buf, 0xf0, PIPE_BUF);

	sret = pwrite(ffd, buf, PIPE_BUF, 0);
	printf("pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[%zd]\n", sret);
	sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
	printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[%zd]\n", sret);

	sret = read(pfds[0], buf, PIPE_BUF);
	printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

	memset(buf2, 0x00, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=0, expecting 0x00 => ret[%d]\n", ret);
	memset(buf2, 0xf0, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=0, checking for 0xf0 => ret[%d]\n", ret);

	sret = read(pfds[0], buf, PIPE_BUF);
	printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

	memset(buf2, 0x1f, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[%d]\n", ret);
	memset(buf2, 0xf0, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[%d]\n", ret);
	return 0;
}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ