[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8001e7c-8039-3efb-948b-482b88005660@yandex-team.ru>
Date: Sun, 3 Nov 2019 15:02:05 +0300
From: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To: Andy Lutomirski <luto@...nel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>
Cc: David Howells <dhowells@...hat.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Nicolas Dichtel <nicolas.dichtel@...nd.com>, raven@...maw.net,
Christian Brauner <christian@...uner.io>,
keyrings@...r.kernel.org, USB list <linux-usb@...r.kernel.org>,
linux-block <linux-block@...r.kernel.org>,
LSM List <linux-security-module@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux API <linux-api@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Miklos Szeredi <miklos@...redi.hu>
Subject: Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]
On 03/11/2019 02.14, Andy Lutomirski wrote:
> On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds
>> <torvalds@...ux-foundation.org> wrote:
>>>
>>> But I don't think anybody actually _did_ any of that. But that's
>>> basically the argument for the three splice operations:
>>> write/vmsplice/splice(). Which one you use depends on the lifetime and
>>> the source of your data. write() is obviously for the copy case (the
>>> source data might not be stable), while splice() is for the "data from
>>> another source", and vmsplace() is "data is from stable data in my
>>> vm".
>>
>> Btw, it's really worth noting that "splice()" and friends are from a
>> more happy-go-lucky time when we were experimenting with new
>> interfaces, and in a day and age when people thought that interfaces
>> like "sendpage()" and zero-copy and playing games with the VM was a
>> great thing to do.
>
> I suppose a nicer interface might be:
>
>
> madvise(buf, len, MADV_STABILIZE);
>
> (MADV_STABILIZE is an imaginary operation that write protects the
> memory a la fork() but without the copying part.)
>
> vmsplice_safer(fd, ...);
>
> Where vmsplice_safer() is like vmsplice, except that it only works on
> write-protected pages. If you vmsplice_safer() some memory and then
> write to the memory, the pipe keeps the old copy.
>
> But this can all be done with memfd and splice, too, I think.
Looks monstrous. This will kill all fun and profit. =)
I think vmsplice should at least deprecate and ignore SPLICE_F_GIFT.
It almost never works - if page still mapped then page_count in
generic_pipe_buf_steal() will be at least 2 (pte and pipe gup).
But if user munmap vma between splicing and consuming (and page not
stuck in lazy tlb and per-cpu vectors) then page from anon lru
could be spliced into file. Ouch.
And looks like fuse device still accepts SPLICE_F_MOVE.
>
>
>>
>> It turns out that VM games are almost always more expensive than just
>> copying the data in the first place, but hey, people didn't know that,
>> and zero-copy was seen a big deal.
>>
>> The reality is that almost nobody uses splice and vmsplice at all, and
>> they have been a much bigger headache than they are worth. If I could
>> go back in time and not do them, I would. But there have been a few
>> very special uses that seem to actually like the interfaces.
>>
>> But it's entirely possible that we should kill vmsplice() (likely by
>> just implementing the semantics as "write()") because it's not common
>> enough to have the complexity.
>
> I think this is the right choice.
>
> FWIW, the openssl vmsplice() call looks dubious, but I suspect it's
> okay because it's vmsplicing to a netlink socket, and the kernel code
> on the other end won't read the data after it returns a response.
>
> --Andy
>
Powered by blists - more mailing lists