linux-kernel - Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez15g5n9AoMJk1yPHsDCq2PGxCHc2WhCAzH8B2o6PgDwzQ@mail.gmail.com>
Date: Tue, 29 Apr 2025 19:20:26 +0200
From: Jann Horn <jannh@...gle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Al Viro <viro@...iv.linux.org.uk>, brauner@...nel.org, 
	linux-fsdevel@...r.kernel.org, akpm@...ux-foundation.org, 
	Liam.Howlett@...cle.com, lorenzo.stoakes@...cle.com, david@...hat.com, 
	vbabka@...e.cz, peterx@...hat.com, hannes@...xchg.org, mhocko@...nel.org, 
	paulmck@...nel.org, shuah@...nel.org, adobriyan@...il.com, 
	josef@...icpanda.com, yebin10@...wei.com, linux@...ssschuh.net, 
	willy@...radead.org, osalvador@...e.de, andrii@...nel.org, 
	ryan.roberts@....com, christophe.leroy@...roup.eu, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

Hi!

(I just noticed that I incorrectly assumed that VMAs use kfree_rcu
(not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
forgot all about that...)

On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@...gle.com> wrote:
> > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > [...]
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > [...]
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > +{
> > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > +       int ret = -EAGAIN;
> > > +
> > > +       memcpy(copy, vma, sizeof(*vma));
> > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > +               goto out;
> >
> > I think this uses get_file_rcu() in a different way than intended.
> >
> > As I understand it, get_file_rcu() is supposed to be called on a
> > pointer which always points to a file with a non-zero refcount (except
> > when it is NULL). That's why it takes a file** instead of a file* - if
> > it observes a zero refcount, it assumes that the pointer must have
> > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > pointer that points to a file with zero refcount, which I think can
> > happen with this patch, will cause an endless loop.
> > (Just as background: For other usecases, get_file_rcu() is supposed to
> > still behave nicely and not spuriously return NULL when the file* is
> > concurrently updated to point to another file*; that's what that loop
> > is for.)
>
> Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> checking the return value of get_file_rcu() and retrying speculation
> if it changed.

I think you could probably still end up looping endlessly in get_file_rcu().

> > (If my understanding is correct, maybe we should document that more
> > explicitly...)
>
> Good point. I'll add comments for get_file_rcu() as a separate patch.
>
> >
> > Also, I think you are introducing an implicit assumption that
> > remove_vma() does not NULL out the ->vm_file pointer (because that
> > could cause tearing and could theoretically lead to a torn pointer
> > being accessed here).
> >
> > One alternative might be to change the paths that drop references to
> > vma->vm_file (search for vma_close to find them) such that they first
> > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > maybe with a new helper like this:
> >
> > static void vma_fput(struct vm_area_struct *vma)
> > {
> >   struct file *file = vma->vm_file;
> >
> >   if (file) {
> >     WRITE_ONCE(vma->vm_file, NULL);
> >     fput(file);
> >   }
> > }
> >
> > Then on the lockless lookup path you could use get_file_rcu() on the
> > ->vm_file pointer _of the original VMA_, and store the returned file*
> > into copy->vm_file.
>
> Ack. Except for storing the return value of get_file_rcu(). I think
> once we detect that  get_file_rcu() returns a different file we should
> bail out and retry. The change in file is an indication that the vma
> got changed from under us, so whatever we have is stale.

What does "different file" mean here - what file* would you compare
the returned one against?