linux-kernel - Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c8001e7c-8039-3efb-948b-482b88005660@yandex-team.ru>
Date:   Sun, 3 Nov 2019 15:02:05 +0300
From:   Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:     Andy Lutomirski <luto@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     David Howells <dhowells@...hat.com>,
        Rasmus Villemoes <linux@...musvillemoes.dk>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Nicolas Dichtel <nicolas.dichtel@...nd.com>, raven@...maw.net,
        Christian Brauner <christian@...uner.io>,
        keyrings@...r.kernel.org, USB list <linux-usb@...r.kernel.org>,
        linux-block <linux-block@...r.kernel.org>,
        LSM List <linux-security-module@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Miklos Szeredi <miklos@...redi.hu>
Subject: Re: [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]

On 03/11/2019 02.14, Andy Lutomirski wrote:
> On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds
>> <torvalds@...ux-foundation.org> wrote:
>>>
>>> But I don't think anybody actually _did_ any of that. But that's
>>> basically the argument for the three splice operations:
>>> write/vmsplice/splice(). Which one you use depends on the lifetime and
>>> the source of your data. write() is obviously for the copy case (the
>>> source data might not be stable), while splice() is for the "data from
>>> another source", and vmsplace() is "data is from stable data in my
>>> vm".
>>
>> Btw, it's really worth noting that "splice()" and friends are from a
>> more happy-go-lucky time when we were experimenting with new
>> interfaces, and in a day and age when people thought that interfaces
>> like "sendpage()" and zero-copy and playing games with the VM was a
>> great thing to do.
> 
> I suppose a nicer interface might be:
> 
> 
> madvise(buf, len, MADV_STABILIZE);
> 
> (MADV_STABILIZE is an imaginary operation that write protects the
> memory a la fork() but without the copying part.)
> 
> vmsplice_safer(fd, ...);
> 
> Where vmsplice_safer() is like vmsplice, except that it only works on
> write-protected pages.  If you vmsplice_safer() some memory and then
> write to the memory, the pipe keeps the old copy.
> 
> But this can all be done with memfd and splice, too, I think.

Looks monstrous. This will kill all fun and profit. =)

I think vmsplice should at least deprecate and ignore SPLICE_F_GIFT.

It almost never works - if page still mapped then page_count in
generic_pipe_buf_steal() will be at least 2 (pte and pipe gup).
But if user munmap vma between splicing and consuming (and page not
stuck in lazy tlb and per-cpu vectors) then page from anon lru
could be spliced into file. Ouch.

And looks like fuse device still accepts SPLICE_F_MOVE.

> 
> 
>>
>> It turns out that VM games are almost always more expensive than just
>> copying the data in the first place, but hey, people didn't know that,
>> and zero-copy was seen a big deal.
>>
>> The reality is that almost nobody uses splice and vmsplice at all, and
>> they have been a much bigger headache than they are worth. If I could
>> go back in time and not do them, I would. But there have been a few
>> very special uses that seem to actually like the interfaces.
>>
>> But it's entirely possible that we should kill vmsplice() (likely by
>> just implementing the semantics as "write()") because it's not common
>> enough to have the complexity.
> 
> I think this is the right choice.
> 
> FWIW, the openssl vmsplice() call looks dubious, but I suspect it's
> okay because it's vmsplicing to a netlink socket, and the kernel code
> on the other end won't read the data after it returns a response.
> 
> --Andy
>