linux-kernel - Re: [PATCH v6 1/2] rust: mm: add abstractions for mm_struct and vm_area

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5fLgiVTK6mv5m4FaNFZMP4vHKo=8UbR9qm56aaB0Dmrp3x3w@mail.gmail.com>
Date: Tue, 12 Nov 2024 13:12:00 +0100
From: Alice Ryhl <aliceryhl@...gle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Miguel Ojeda <ojeda@...nel.org>, Matthew Wilcox <willy@...radead.org>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	John Hubbard <jhubbard@...dia.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, 
	Arnd Bergmann <arnd@...db.de>, Alex Gaynor <alex.gaynor@...il.com>, Boqun Feng <boqun.feng@...il.com>, 
	Gary Guo <gary@...yguo.net>, Björn Roy Baron <bjorn3_gh@...tonmail.com>, 
	Benno Lossin <benno.lossin@...ton.me>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	rust-for-linux@...r.kernel.org, Andreas Hindborg <a.hindborg@...nel.org>, 
	Wedson Almeida Filho <wedsonaf@...il.com>
Subject: Re: [PATCH v6 1/2] rust: mm: add abstractions for mm_struct and vm_area_struct

On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@...gle.com> wrote:
>
> On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@...gle.com> wrote:
> > Thanks a lot for your review!
> >
> > Note that there is a v7 already:
> > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@google.com/
>
> Yeah, oops, somehow I only realized I was looking at an older version
> of the series after sending my mail...
>
> > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@...gle.com> wrote:
> > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@...gle.com> wrote:
> > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > >> these in a few different ways.
> > >>
> > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > >> call which allows it to modify the vma in ways that are only okay during
> > >> initial setup. This is the case where the most methods are available.
> > >>
> > >> However, Rust Binder needs to insert and remove pages from the vma as
> > >> time passes. When incoming messages arrive, pages may need to be
> > >> inserted if space is missing, and in this case that is done by using a
> > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > >
> > > FYI, the way the C binder implementation uses vma_lookup() and
> > > vm_insert_page() is not very idiomatic. The standard way of
> > > implementing binder_alloc_free_page() would be to use something like
> > > unmap_mapping_range() instead of using
> > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > have to create one inode per binder file, maybe with something like
> > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > binder_install_single_page(), the standard way would be to let
> > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > have to take mmap locks or grab MM references yourself.
> >
> > Would that actually work? All writes happen from kernel space, so it's
> > not clear to me that we can trigger the fault handler from there. We
> > can't use copy_to_user because the write happens from a different
> > process than the one that owns the vma.
> >
> > That said, one alternative is to let Binder store an array of `struct
> > page` and just write directly to those pages when writing from kernel
> > space. Then binder_vm_fault() can look up the relevant page from the
> > array when userspace attempts to read the data.
>
> Right, that's what I was thinking of - keep allocating pages at the
> same point in time when binder currently does it, only defer mapping
> them into userspace.
>
> > Though we could also
> > just vm_insert_page() right before returning the address to userspace,
> > since we know that userspace will read it right away after the syscall
> > returns.
>
> I think that is basically what binder does now?

Right now binder calls vm_insert_page() right after calling
alloc_page(), which means that vm_insert_page() happens in the process
sending the message. But we could delay the call so that it happens in
the process that receives the message instead (which is also the
process that owns the mapping).

> > > I guess the one benefit of the current implementation of C binder over
> > > the more idiomatic approach is that it avoids taking a fault after
> > > binder_install_buffer_pages(), but that's probably not such a hot path
> > > that it matters...
> >
> > Saying "probably not a hot path" is dangerous when it comes to binder's
> > allocator. :)
>
> Each page installed by binder_install_buffer_pages() has to be torn
> down by either binder's shrinker callback or by userspace unmapping
> binder VMAs, right? So for this to be a hot path, the system would
> have to either constantly create and destroy new binder FDs, or
> execute the shrinker at a fairly high frequency while lots of binder
> clients perform infrequent binder transactions?

I think the situation where the phone is under major memory pressure
and runs the shrinker all the time can be rather subtle.

> > > Let me know if you think it would be helpful to see a prototype of
> > > that in C - I think I can cobble that together, but doing it nicely
> > > will require some work to convert at least some of the binder_alloc
> > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > the binder file or something like that. (I guess modeling that in Rust
> > > might be a bit of a pain though...)
> >
> > It would be useful to hear about what the advantages of
> > unmap_mapping_range() are. I don't need a full prototype, I should be
> > able to understand given a rough description of what the required
> > changes are.
>
> The nice part is that basically, if you have a pointer to the file or
> the inode, you can just do something like the following to zap a PTE:
>
> unmap_mapping_range(file->f_mapping, index, 1, 1);
>
> You don't have to take any locks yourself to make that work, you don't
> even have to hold a reference to the mm_struct or anything like that,
> and you don't need any of that logic you currently have in the C
> binder driver to look up the VMA by address and revalidate it is still
> the VMA you expect. The MM subsystem will automatically iterate
> through all VMAs that overlap the specified range of the file's
> address_space, and it will zap PTEs in the specified range (and it
> even works fine if the VMAs have been moved or split or exist in
> multiple processes or stuff like that). It's a similar story on the
> allocation path. The only locks you need to explicitly take are
> whatever locks the driver internally requires.
>
> Going through the fault handler and/or the address_space for
> installing/removing PTEs, instead of using vma_lookup(), is also safer
> because it implicitly ensures you can only operate on your own
> driver's VMAs. From a glance at the Rust binder RFC
> (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@google.com/),
> it looks like use_page_slow() just looks up the VMA at the expected
> address and calls insert_page() on it. I don't immediately see
> anything in the Rust Binder RFC that would prevent calling
> insert_page() on a newly created VMA of a different type that has
> appeared at the same address - which could cause the page to
> inadvertently become writable by userspace (if the new VMA is
> writable), could cause refcounted pages to be installed in VM_PFNMAP
> regions that are supposed to only contain non-refcounted entries,
> could probably cause type confusion when trying to install 4K pages in
> hugetlb regions that can't contain 4K pages, and so on.

Right ... I guess I'm missing an equivalent to binder_vma_close to
ensure that once the vma is closed, we don't try to insert pages.

I gave a suggestion to enforce that vm_insert_page is only called on
MIXEDMAP vmas in my previous email. I guess that would prevent the
type confusion you mention, but it still seems dangerous ... you risk
that some other driver is storing special data in the private data of
pages in the new vma, and if you insert a random page there, there
could maybe be type confusion on the private data in the `struct
page`?

> Though I just realized, you're only doing this in the shrinker
> callback where you're not supposed to sleep, but unmap_mapping_range()
> requires sleeping locks. So I guess directly using
> unmap_mapping_range() wouldn't work so well. I guess one way to
> address that could be to add a trylock version of
> unmap_mapping_range().
>
> Another more fancy solution to that would be to stop using shrinkers
> entirely, and instead make binder pages show up as clean file pages on
> the kernel's page allocation LRU lists, and let the kernel take care
> of removing page mappings - basically similar to how reclaim works for
> normal file pages. I'm a bit fuzzy on how this area of the kernel
> works exactly; one option might be to do something similar to
> aio_private_file(), creating a new anonymous inode - but unlike in
> AIO, you'd then install that anonymous inode's address_space as the
> i_mapping of the existing file in binder_open(), rather than creating
> a new file. Then you could pin the inode's pages into a page pointer
> array in the kernel (sort of like in aio_setup_ring(), except you'd
> only do it on demand in binder_install_buffer_pages()), and then have
> a "release_folio" operation in your address_space_operations that
> drops your page reference if the page is currently unused. At that
> point, the driver wouldn't really be participating in creating or
> removing userspace PTEs at all, the kernel would mostly manage it for
> you, except that you'd have to tell the kernel when it is okay to
> reclaim pages, or something like that.

Whether it's okay to reclaim a given page can flip-flop very quickly
under some workflows. Changing that setting has to be a pretty fast
operation.

> > >> +    /// Sets the flags associated with the virtual memory area.
> > >> +    ///
> > >> +    /// The possible flags are a combination of the constants in [`flags`].
> > >> +    #[inline]
> > >> +    pub fn set_flags(&self, flags: usize) {
> > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > >> +        // not a data race.
> > >> +        unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags as _ };
> > >
> > > It can be a data race, for example you could race with pte_mkwrite()
> > > called from remove_migration_pte(). (But that's an existing issue, the
> > > same data race currently occurs with existing vm_flags updates in
> > > other core kernel paths.)
> > >
> > > Separately from that: If you want to update the flags of existing
> > > VMAs, you need to call vma_start_write() first. The vm_flags_set()
> > > helper is designed to enforce this automatically.
> > >
> > > But really I think drivers normally don't change the flags of existing
> > > VMAs anyway, only the flags of newly created VMAs; only core MM code
> > > should be changing the flags of existing VMAs. You should maybe move
> > > this method over into VmAreaNew.
> >
> > Yeah. Based on what you're saying, moving this method to VmAreaNew
> > sounds like a good idea.
> >
> > Just to confirm, vma_start_write() is not necessary if I move it to
> > VmAreaNew?
>
> Right, the MM subsystem will have already internally called
> vma_start_write() on the VMA by that time the ->mmap handler is
> called.
>
> > > There are ways to set VMA flags that are dangerous and invalid; I
> > > think VM_PFNMAP|VM_MIXEDMAP would be an example (though that might not
> > > actually lead to any actual bad consequences with the way the code is
> > > currently written). Clearing VM_MIXEDMAP after calling
> > > vm_insert_page() might be another one.
> > >
> > > VM_HUGETLB is also a flag you really shouldn't be touching from driver
> > > code, that would cause type confusion through a bad cast in
> > > unmap_single_vma -> __unmap_hugepage_range -> hstate_vma ->
> > > hstate_file -> hstate_inode.
> > >
> > > I'm not sure how safe against programmer error this is supposed to be;
> > > if you want the API to be safe even against silly blunders, I think
> > > you'd have to at least allowlist which bits the driver may clear and
> > > which bits it may set, and additionally deny at least the bad
> > > combination VM_MIXEDMAP|VM_PFNMAP. Though you might want to introduce
> > > higher-level helpers instead, so that a driver author doesn't blunder
> > > by checking VM_WRITE without also clearing VM_MAYWRITE, which is a
> > > somewhat common mistake that IIRC has led to several security bugs in
> > > the past.
> >
> > In general, the goal is to write an abstraction such that you cannot
> > trigger memory safety issues, even if triggering it involves a silly
> > blunder. But only memory safety issues *must* be prevented; other issues
> > such as BUG_ON or deadlocks or resource leaks need not be prevented,
> > though those are still a nice-to-have when possible.
>
> Ah, thanks for the context.
>
> (I think in the Linux kernel, you might theoretically be able to cause
> memory safety issues by deadlocking in particular ways - if you are
> running on a CONFIG_PREEMPT kernel without panic_on_warn set, and
> you're in a non-preemptible context because you're holding a spinlock
> or such, and then you sleep because you try to wait on a mutex or
> kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling
> while atomic" error message and then try to recover from that
> situation by resetting the preempt_count and scheduling anyway. I
> think that theoretically breaks the RCU read-side critical sections
> normally implied by spinlocks once you return to the broken context,
> though IIRC this situation will get fixed up at the next syscall
> return or something along those lines. I haven't tried this though.)

Sleeping in atomic context is a known area of concern. We're working
on it. For now, just assume that it's one of the allowed bad things.
Eventually we would like to handle it properly with this tool:
https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/

> I think the MM subsystem is a place where logic issues and memory
> safety issues are closer to each other than in most other parts of the
> kernel, since much of the logic in the MM subsystem is responsible for
> managing memory...

Agreed.

> > > I think the reasonable ways in which drivers might want to change
> > > flags on a VmAreaNew are:
> > >
> > > 1. clear VM_MAYWRITE after checking VM_WRITE is unset
> > > 2. (you could do the same with the VM_EXEC/VM_MAYEXEC bits)
> > > 3. set VM_PFNMAP (only valid if not VM_MIXEDMAP and no PTEs have been
> > > installed yet)
> > > 4. set VM_MIXEDMAP (only valid if not VM_PFNMAP)
> > > 5. set VM_IO/VM_DONTEXPAND
> > > 6. set VM_DONTCOPY/VM_DONTDUMP (only permanently effective if VM_IO is also set)
> >
> > Okay, thanks for the overview. Comparing with the drivers I'm familiar
> > with, they do the following:
> >
> > Ashmem: Clear VM_MAYREAD/VM_MAYWRITE/VM_MAYEXEC after checking
> > VM_READ/VM_WRITE/VM_EXEC. This is allowed by what you say. (You did not
> > mention VM_READ but I assume the same applies here.)
>
> Yeah, I think VM_READ should work the same way. (With the caveat that
> if you tried to create a mapping that can't be read but can be
> written, or can't be read but can be executed, the restriction on read
> access might be ineffective. In particular, arch-specific code for
> mapping VMA access configuration to page table bits might silently
> ignore such a restriction because the architecture has no way to
> express write-only/execute-only access in the page table bits. I
> imagine other kernel code might also have the assumption baked in that
> at least write access implies read access.)

Got it, makes sense.

> > Binder: Clears VM_MAYWRITE after checking VM_WRITE. Ok. Binder also sets
> > the flags VM_DONTCOPY and VM_MIXEDMAP. Is a VM_PFNMAP check missing? But
> > I guess that flag only gets set if you do so yourself, so the check is
> > probably not necessary?
>
> Yeah, VM_PFNMAP only gets set if you do that yourself (or if you call
> remap_pfn_range(), which will do that for you).

Got it.

> > >> +    /// Maps a single page at the given address within the virtual memory area.
> > >> +    ///
> > >> +    /// This operation does not take ownership of the page.
> > >> +    #[inline]
> > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > >> +        // `address` is already checked by `vm_insert_page`.
> > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > >
> > > vm_insert_page() has a kinda weird contract because there are two
> > > contexts from which you can call it cleanly:
> > >
> > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > you only need to hold an mmap read lock or VMA read lock, no write
> > > lock is required); however, you must ensure that the VMA is already
> > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > this from a fault handler.
> > >
> > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > VM_MIXEDMAP for you).
> > >
> > > I think nothing would immediately crash if you called vm_insert_page()
> > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > lock in write mode; but that would permit weird scenarios where you
> > > could, for example, have a userfaultfd context associated with a VMA
> > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > anything bad, but it would be a weird state that probably shouldn't be
> > > permitted.
> > >
> > > There are also safety requirements for the page being installed, but I
> > > guess the checks that vm_insert_page() already does via
> > > validate_page_before_insert() might be enough to make this safe...
> >
> > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > have a vm_insert_page method.
>
> Sounds reasonable.
>
> > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > careful, it would be a way to set VM_MIXEDMAP on something that already
> > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > directly and then go through the method on VmAreaRef.
>
> Makes sense.
>
> I guess one tricky part is that it might be bad if you could

Seems like this sentence is incomplete?

Alice