linux-kernel - Re: [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbphUBPnA7iDz5pis17GRwzpqsduftV_JHyf1Ce0MMqzw@mail.gmail.com>
Date: Tue, 28 May 2024 13:36:42 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>, Andrii Nakryiko <andrii@...nel.org>, 
	linux-fsdevel@...r.kernel.org, brauner@...nel.org, viro@...iv.linux.org.uk, 
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org, bpf@...r.kernel.org, 
	gregkh@...uxfoundation.org, linux-mm@...ck.org, surenb@...gle.com, 
	rppt@...nel.org
Subject: Re: [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in
 PROCMAP_QUERY API

On Fri, May 24, 2024 at 12:48 PM Liam R. Howlett
<Liam.Howlett@...cle.com> wrote:
>
> * Andrii Nakryiko <andrii@...nel.org> [240524 00:10]:
> > Attempt to use RCU-protected per-VAM lock when looking up requested VMA
> > as much as possible, only falling back to mmap_lock if per-VMA lock
> > failed. This is done so that querying of VMAs doesn't interfere with
> > other critical tasks, like page fault handling.
> >
> > This has been suggested by mm folks, and we make use of a newly added
> > internal API that works like find_vma(), but tries to use per-VMA lock.
>
> Thanks for doing this.
>
> >
> > Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
> > ---
> >  fs/proc/task_mmu.c | 42 ++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 34 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8ad547efd38d..2b14d06d1def 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -389,12 +389,30 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> >  )
> >
> >  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > -                                              unsigned long addr, u32 flags)
> > +                                              unsigned long addr, u32 flags,
> > +                                              bool *mm_locked)
> >  {
> >       struct vm_area_struct *vma;
> > +     bool mmap_locked;
> > +
> > +     *mm_locked = mmap_locked = false;
> >
> >  next_vma:
> > -     vma = find_vma(mm, addr);
> > +     if (!mmap_locked) {
> > +             /* if we haven't yet acquired mmap_lock, try to use less disruptive per-VMA */
> > +             vma = find_and_lock_vma_rcu(mm, addr);
> > +             if (IS_ERR(vma)) {
>
> There is a chance that find_and_lock_vma_rcu() will return NULL when
> there should never be a NULL.
>
> If you follow the MAP_FIXED call to mmap(), you'll land in map_region()
> which does two operations: munmap(), then the mmap().  Since this was
> behind a lock, it was fine.  Now that we're transitioning to rcu
> readers, it's less ideal.  We have a race where we will see that gap.
> In this implementation we may return NULL if the MAP_FIXED is at the end
> of the address space.
>
> It might also cause issues if we are searching for a specific address
> and we will skip a VMA that is currently being inserted by MAP_FIXED.
>
> The page fault handler doesn't have this issue as it looks for a
> specific address then falls back to the lock if one is not found.
>
> This problem needs to be fixed prior to shifting the existing proc maps
> file to using rcu read locks as well.  We have a solution that isn't
> upstream or on the ML, but is being tested and will go upstream.

Ok, any ETA for that? Can it be retrofitted into
find_and_lock_vma_rcu() once the fix lands? It's not ideal, but I
think it's acceptable (for now) for this new API to have this race,
given it seems quite unlikely to be hit in practice.

Worst case, we can leave the per-VMA RCU-protected bits out until we
have this solution in place, and then add it back when ready.

>
> > +                     /* failed to take per-VMA lock, fallback to mmap_lock */
> > +                     if (mmap_read_lock_killable(mm))
> > +                             return ERR_PTR(-EINTR);
> > +
> > +                     *mm_locked = mmap_locked = true;
> > +                     vma = find_vma(mm, addr);
>
> If you lock the vma here then drop the mmap lock, then you should be
> able to simplify the code by avoiding the passing of the mmap_locked
> variable around.
>
> It also means we don't need to do an unlokc_vma() call, which indicates
> we are going to end the vma read but actually may be unlocking the mm.
>
> This is exactly why I think we need a common pattern and infrastructure
> to do this sort of walking.
>
> Please have a look at userfaultfd patches here [1].  Note that
> vma_start_read() cannot be used in the mmap_read_lock() critical
> section.

Ok, so you'd like me to do something like below, right?

vma = find_vma(mm, addr);
if (vma)
    down_read(&vma->vm_lock->lock)
mmap_read_unlock(mm);

.. and for the rest of logic always assume having per-VMA lock. ...


The problem here is that I think we can't assume per-VMA lock, because
it's gated by CONFIG_PER_VMA_LOCK, so I think we'll have to deal with
this mmap_locked flag either way. Or am I missing anything?

I don't think the flag makes things that much worse, tbh, but I'm
happy to accommodate any better solution that would work regardless of
CONFIG_PER_VMA_LOCK.

>
> > +             }
> > +     } else {
> > +             /* if we have mmap_lock, get through the search as fast as possible */
> > +             vma = find_vma(mm, addr);
>
> I think the only way we get here is if we are contending on the mmap
> lock.  This is actually where we should try to avoid holding the lock?
>
> > +     }
> >
> >       /* no VMA found */
> >       if (!vma)
> > @@ -428,18 +446,25 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> >  skip_vma:
> >       /*
> >        * If the user needs closest matching VMA, keep iterating.
> > +      * But before we proceed we might need to unlock current VMA.
> >        */
> >       addr = vma->vm_end;
> > +     if (!mmap_locked)
> > +             vma_end_read(vma);
> >       if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> >               goto next_vma;
> >  no_vma:
> > -     mmap_read_unlock(mm);
> > +     if (mmap_locked)
> > +             mmap_read_unlock(mm);
> >       return ERR_PTR(-ENOENT);
> >  }
> >
> > -static void unlock_vma(struct vm_area_struct *vma)
> > +static void unlock_vma(struct vm_area_struct *vma, bool mm_locked)
>
> Confusing function name, since it may not be doing anything with the
> vma lock.

Would "unlock_vma_or_mm()" be ok?

>
> >  {
> > -     mmap_read_unlock(vma->vm_mm);
> > +     if (mm_locked)
> > +             mmap_read_unlock(vma->vm_mm);
> > +     else
> > +             vma_end_read(vma);
> >  }
> >
> >  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > @@ -447,6 +472,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >       struct procmap_query karg;
> >       struct vm_area_struct *vma;
> >       struct mm_struct *mm;
> > +     bool mm_locked;
> >       const char *name = NULL;
> >       char *name_buf = NULL;
> >       __u64 usize;
> > @@ -475,7 +501,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >       if (!mm || !mmget_not_zero(mm))
> >               return -ESRCH;
> >
> > -     vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
> > +     vma = query_matching_vma(mm, karg.query_addr, karg.query_flags, &mm_locked);
> >       if (IS_ERR(vma)) {
> >               mmput(mm);
> >               return PTR_ERR(vma);
> > @@ -542,7 +568,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >       }
> >
> >       /* unlock vma/mm_struct and put mm_struct before copying data to user */
> > -     unlock_vma(vma);
> > +     unlock_vma(vma, mm_locked);
> >       mmput(mm);
> >
> >       if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> > @@ -558,7 +584,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >       return 0;
> >
> >  out:
> > -     unlock_vma(vma);
> > +     unlock_vma(vma, mm_locked);
> >       mmput(mm);
> >       kfree(name_buf);
> >       return err;
> > --
> > 2.43.0
> >
>
> [1]. https://lore.kernel.org/linux-mm/20240215182756.3448972-5-lokeshgidra@google.com/
>
> Thanks,
> Liam