[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8698ba1f-fc5d-a82e-842b-100dc8957f2f@redhat.com>
Date: Fri, 8 Sep 2023 18:48:05 +0200
From: David Hildenbrand <david@...hat.com>
To: Christoph Hellwig <hch@....de>
Cc: Jan Kara <jack@...e.cz>, David Howells <dhowells@...hat.com>,
Peter Xu <peterx@...hat.com>, Lei Huang <lei.huang@...ux.intel.com>,
miklos@...redi.hu, Xiubo Li <xiubli@...hat.com>,
Ilya Dryomov <idryomov@...il.com>, Jeff Layton <jlayton@...nel.org>,
Trond Myklebust <trond.myklebust@...merspace.com>,
Anna Schumaker <anna@...nel.org>, Latchesar Ionkov <lucho@...kov.net>,
Dominique Martinet <asmadeus@...ewreck.org>,
Christian Schoenebeck <linux_oss@...debyte.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
John Fastabend <john.fastabend@...il.com>,
Jakub Sitnicki <jakub@...udflare.com>, Boris Pismenny <borisp@...dia.com>,
linux-nfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
ceph-devel@...r.kernel.org, linux-mm@...ck.org, v9fs@...ts.linux.dev,
netdev@...r.kernel.org
Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET)
On 08.09.23 10:15, Christoph Hellwig wrote:
> On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
>>> and iov_iter_get_pages_alloc2. We have three file system direct I/O
>>> users of those left: ceph, fuse and nfs. Lei Huang has sent patches
>>> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
>>> and we'd need equivalent work for ceph and nfs.
>>>
>>> The non-file system uses are in the vmsplice code, which only reads
>>
>> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good;
>> I recall that David Howells had patches for that at one point. (at least to
>> use FOLL_PIN)
>
> Hmm, unless I'm misreading the code vmsplace is only using
> iov_iter_get_pages2 for reading from the user address space anyway.
> Or am I missing something?
It's not relevant for the case you're describing here ("last memory
modifitions through gup(FOLL_GET)").
vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()
So it ends up calling get_user_pages_fast()
... and not using FOLL_PIN|FOLL_LONGTERM
Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users
can grab a reference on a page for all eternity, breaking CMA and memory
hotunplug (well, and harming compaction).
Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for
anonymous memory, this will also take care of the last remaining hugetlb
COW test (trigger COW unsharing) as commented back in:
https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@redhat.com/
>
>>> After that we might have to do an audit of the raw get_user_pages APIs,
>>> but there probably aren't many that modify file backed memory.
>>
>> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.
>
> Yes, if that ends up on file backed shared mappings we also need a pin.
See below.
>
>> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level
>> page tables for VMs, and uses MMU notifiers to synchronize the second-level
>> page tables with process page table changes. So once a PTE goes from
>> writable -> r/o in the process page table, the second level page tables for
>> the VM will get updated. Such MMU users are quite different from ordinary
>> GUP users.
>
> Can KVM page tables use file backed shared mappings?
Yes, usually shmem and hugetlb. But with things like emulated
NVDIMMs/virtio-pmem for VMs, easily also ordinary files.
But it's really not ordinary write access through GUP. It's write access
via a secondary page table (secondary MMU), that's synchronized to the
process page table -- just like if the CPU would be writing to the page
using the process page tables (primary MMU).
>
>> Converting ptrace might not be desired/required as well (the reference is
>> dropped immediately after the read/write access).
>
> But the pin is needed to make sure the file system can account for
> dirtying the pages. Something we fundamentally can't do with get.
ptrace will find the pagecache page writable in the page table (PTE
write bit set), if it intends to write to the page (FOLL_WRITE). If it
is not writable, it will trigger a page fault that informs the file system.
With an FS that wants writenotify, we will not map a page writable (PTE
write bit not set) unless it is dirty (PTE dirty bit set) IIRC.
So are we concerned about a race between the filesystem removing the PTE
write bit (to catch next write access before it gets dirtied again) and
ptrace marking the page dirty?
It's a very, very small race window, staring at __access_remote_vm().
But it should apply if that's the concern.
>
>> The end goal as discussed a couple of times would be the to limit FOLL_GET
>> in general only to a couple of users that can be audited and keep using it
>> for a good reason. Arbitrary drivers that perform DMA should stop using it
>> (and ideally be prevented from using it) and switch to FOLL_PIN.
>
> Agreed, that's where I'd like to get to. Preferably with the non-pin
> API not even beeing epxorted to modules.
Yes. However, secondary MMU users (like KVM) would need some way to keep
making use of that; ideally, using a proper separate interface instead
of (ab)using plain GUP and confusing people :)
[1] https://lkml.org/lkml/2023/1/24/451
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists