[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez3YLWh9hXQQdGVQ7hCsd=k_i2Z2NO6qzT6NaOYiRjy=nw@mail.gmail.com>
Date: Tue, 29 Apr 2025 17:39:25 +0200
From: Jann Horn <jannh@...gle.com>
To: Suren Baghdasaryan <surenb@...gle.com>, Al Viro <viro@...iv.linux.org.uk>, brauner@...nel.org,
linux-fsdevel@...r.kernel.org
Cc: akpm@...ux-foundation.org, Liam.Howlett@...cle.com,
lorenzo.stoakes@...cle.com, david@...hat.com, vbabka@...e.cz,
peterx@...hat.com, hannes@...xchg.org, mhocko@...nel.org, paulmck@...nel.org,
shuah@...nel.org, adobriyan@...il.com, josef@...icpanda.com,
yebin10@...wei.com, linux@...ssschuh.net, willy@...radead.org,
osalvador@...e.de, andrii@...nel.org, ryan.roberts@....com,
christophe.leroy@...roup.eu, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> With maple_tree supporting vma tree traversal under RCU and vma and
> its important members being RCU-safe, /proc/pid/maps can be read under
> RCU and without the need to read-lock mmap_lock. However vma content
> can change from under us, therefore we make a copy of the vma and we
> pin pointer fields used when generating the output (currently only
> vm_file and anon_name). Afterwards we check for concurrent address
> space modifications, wait for them to end and retry. While we take
> the mmap_lock for reading during such contention, we do that momentarily
> only to record new mm_wr_seq counter. This change is designed to reduce
> mmap_lock contention and prevent a process reading /proc/pid/maps files
> (often a low priority task, such as monitoring/data collection services)
> from blocking address space updates.
[...]
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f9d50a61167c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +/*
> + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> + * show_map_vma.
> + */
> +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> +{
> + struct vm_area_struct *copy = &priv->vma_copy;
> + int ret = -EAGAIN;
> +
> + memcpy(copy, vma, sizeof(*vma));
> + if (copy->vm_file && !get_file_rcu(©->vm_file))
> + goto out;
I think this uses get_file_rcu() in a different way than intended.
As I understand it, get_file_rcu() is supposed to be called on a
pointer which always points to a file with a non-zero refcount (except
when it is NULL). That's why it takes a file** instead of a file* - if
it observes a zero refcount, it assumes that the pointer must have
been updated in the meantime, and retries. Calling get_file_rcu() on a
pointer that points to a file with zero refcount, which I think can
happen with this patch, will cause an endless loop.
(Just as background: For other usecases, get_file_rcu() is supposed to
still behave nicely and not spuriously return NULL when the file* is
concurrently updated to point to another file*; that's what that loop
is for.)
(If my understanding is correct, maybe we should document that more
explicitly...)
Also, I think you are introducing an implicit assumption that
remove_vma() does not NULL out the ->vm_file pointer (because that
could cause tearing and could theoretically lead to a torn pointer
being accessed here).
One alternative might be to change the paths that drop references to
vma->vm_file (search for vma_close to find them) such that they first
NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
maybe with a new helper like this:
static void vma_fput(struct vm_area_struct *vma)
{
struct file *file = vma->vm_file;
if (file) {
WRITE_ONCE(vma->vm_file, NULL);
fput(file);
}
}
Then on the lockless lookup path you could use get_file_rcu() on the
->vm_file pointer _of the original VMA_, and store the returned file*
into copy->vm_file.
> + if (!anon_vma_name_get_if_valid(copy))
> + goto put_file;
> +
> + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> + return 0;
We only check for concurrent updates at this point, so up to here,
anything we read from "copy" could contain torn pointers (both because
memcpy() is not guaranteed to copy pointers atomically and because the
updates to the original VMA are not done with WRITE_ONCE()).
That probably means that something like the preceding
anon_vma_name_get_if_valid() could crash on an access to a torn
pointer.
Please either do another mmap_lock_speculate_retry() check directly
after the memcpy(), or ensure nothing before this point reads from
"copy".
> + /* Address space got modified, vma might be stale. Re-lock and retry. */
> + rcu_read_unlock();
> + ret = mmap_read_lock_killable(priv->mm);
> + if (!ret) {
> + /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> + mmap_read_unlock(priv->mm);
> + ret = -EAGAIN;
> + }
> +
> + rcu_read_lock();
> +
> + anon_vma_name_put_if_valid(copy);
> +put_file:
> + if (copy->vm_file)
> + fput(copy->vm_file);
> +out:
> + return ret;
> +}
[...]
> @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
> } else {
> *path = file_user_path(vma->vm_file);
> }
> - return;
> + goto out;
> }
>
> if (vma->vm_ops && vma->vm_ops->name) {
> *name = vma->vm_ops->name(vma);
This seems to me like a big, subtle change of semantics. After this
change, vm_ops->name() will no longer receive a real VMA; and in
particular, I think the .name implementation special_mapping_name used
in special_mapping_vmops will have a UAF because it relies on
vma->vm_private_data pointing to a live object.
I think you'll need to fall back to using the mmap lock and the real
VMA if you see a non-NULL vma->vm_ops->name pointer.
> if (*name)
> - return;
> + goto out;
> }
>
> *name = arch_vma_name(vma);
> if (*name)
> - return;
> + goto out;
>
> if (!vma->vm_mm) {
> *name = "[vdso]";
> - return;
> + goto out;
> }
>
> if (vma_is_initial_heap(vma)) {
> *name = "[heap]";
> - return;
> + goto out;
> }
>
> if (vma_is_initial_stack(vma)) {
> *name = "[stack]";
> - return;
> + goto out;
> }
>
> if (anon_name) {
> *name_fmt = "[anon:%s]";
> *name = anon_name->name;
> - return;
> }
> +out:
> + if (anon_name && !mmap_locked)
> + anon_vma_name_put(anon_name);
Isn't this refcount drop too early, causing UAF read? We drop the
reference on the anon_name here, but (on some paths) we're about to
return anon_name->name to the caller through *name, and the caller
will read from it.
Ah, but I guess it's actually fine because the refcount increment was
unnecessary in the first place, because the vma pointer actually
points to a copy of the original VMA, and the copy has its own
refcounted reference to the anon_name thanks to get_vma_snapshot()?
It might be helpful to have some comments documenting which VMA
pointers can point to copies.
Powered by blists - more mailing lists