lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8698ba1f-fc5d-a82e-842b-100dc8957f2f@redhat.com>
Date: Fri, 8 Sep 2023 18:48:05 +0200
From: David Hildenbrand <david@...hat.com>
To: Christoph Hellwig <hch@....de>
Cc: Jan Kara <jack@...e.cz>, David Howells <dhowells@...hat.com>,
 Peter Xu <peterx@...hat.com>, Lei Huang <lei.huang@...ux.intel.com>,
 miklos@...redi.hu, Xiubo Li <xiubli@...hat.com>,
 Ilya Dryomov <idryomov@...il.com>, Jeff Layton <jlayton@...nel.org>,
 Trond Myklebust <trond.myklebust@...merspace.com>,
 Anna Schumaker <anna@...nel.org>, Latchesar Ionkov <lucho@...kov.net>,
 Dominique Martinet <asmadeus@...ewreck.org>,
 Christian Schoenebeck <linux_oss@...debyte.com>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 John Fastabend <john.fastabend@...il.com>,
 Jakub Sitnicki <jakub@...udflare.com>, Boris Pismenny <borisp@...dia.com>,
 linux-nfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 ceph-devel@...r.kernel.org, linux-mm@...ck.org, v9fs@...ts.linux.dev,
 netdev@...r.kernel.org
Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET)

On 08.09.23 10:15, Christoph Hellwig wrote:
> On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
>>> and iov_iter_get_pages_alloc2.  We have three file system direct I/O
>>> users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
>>> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
>>> and we'd need equivalent work for ceph and nfs.
>>>
>>> The non-file system uses are in the vmsplice code, which only reads
>>
>> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good;
>> I recall that David Howells had patches for that at one point. (at least to
>> use FOLL_PIN)
> 
> Hmm, unless I'm misreading the code vmsplace is only using
> iov_iter_get_pages2 for reading from the user address space anyway.
> Or am I missing something?

It's not relevant for the case you're describing here ("last memory 
modifitions through gup(FOLL_GET)").

vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()

So it ends up calling get_user_pages_fast()

... and not using FOLL_PIN|FOLL_LONGTERM

Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users 
can grab a reference on a page for all eternity, breaking CMA and memory 
hotunplug (well, and harming compaction).

Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for 
anonymous memory, this will also take care of the last remaining hugetlb 
COW test (trigger COW unsharing) as commented back in:

https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@redhat.com/


> 
>>> After that we might have to do an audit of the raw get_user_pages APIs,
>>> but there probably aren't many that modify file backed memory.
>>
>> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.
> 
> Yes, if that ends up on file backed shared mappings we also need a pin.

See below.

> 
>> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level
>> page tables for VMs, and uses MMU notifiers to synchronize the second-level
>> page tables with process page table changes. So once a PTE goes from
>> writable -> r/o in the process page table, the second level page tables for
>> the VM will get updated. Such MMU users are quite different from ordinary
>> GUP users.
> 
> Can KVM page tables use file backed shared mappings?

Yes, usually shmem and hugetlb. But with things like emulated 
NVDIMMs/virtio-pmem for VMs, easily also ordinary files.

But it's really not ordinary write access through GUP. It's write access 
via a secondary page table (secondary MMU), that's synchronized to the 
process page table -- just like if the CPU would be writing to the page 
using the process page tables (primary MMU).

> 
>> Converting ptrace might not be desired/required as well (the reference is
>> dropped immediately after the read/write access).
> 
> But the pin is needed to make sure the file system can account for
> dirtying the pages.  Something we fundamentally can't do with get.

ptrace will find the pagecache page writable in the page table (PTE 
write bit set), if it intends to write to the page (FOLL_WRITE). If it 
is not writable, it will trigger a page fault that informs the file system.

With an FS that wants writenotify, we will not map a page writable (PTE 
write bit not set) unless it is dirty (PTE dirty bit set) IIRC.

So are we concerned about a race between the filesystem removing the PTE 
write bit (to catch next write access before it gets dirtied again) and 
ptrace marking the page dirty?

It's a very, very small race window, staring at __access_remote_vm(). 
But it should apply if that's the concern.

> 
>> The end goal as discussed a couple of times would be the to limit FOLL_GET
>> in general only to a couple of users that can be audited and keep using it
>> for a good reason. Arbitrary drivers that perform DMA should stop using it
>> (and ideally be prevented from using it) and switch to FOLL_PIN.
> 
> Agreed, that's where I'd like to get to.  Preferably with the non-pin
> API not even beeing epxorted to modules.

Yes. However, secondary MMU users (like KVM) would need some way to keep 
making use of that; ideally, using a proper separate interface instead 
of (ab)using plain GUP and confusing people :)

[1] https://lkml.org/lkml/2023/1/24/451

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ