linux-kernel - Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpE_jJ0Xq5T0HcLpquRzO+NdvN3T3_JXEwSjt2NG9Ryy5g@mail.gmail.com>
Date: Wed, 23 Apr 2025 14:49:33 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: akpm@...ux-foundation.org, Liam.Howlett@...cle.com, 
	lorenzo.stoakes@...cle.com, david@...hat.com, vbabka@...e.cz, 
	peterx@...hat.com, jannh@...gle.com, hannes@...xchg.org, mhocko@...nel.org, 
	paulmck@...nel.org, shuah@...nel.org, adobriyan@...il.com, brauner@...nel.org, 
	josef@...icpanda.com, yebin10@...wei.com, linux@...ssschuh.net, 
	willy@...radead.org, osalvador@...e.de, andrii@...nel.org, 
	ryan.roberts@....com, christophe.leroy@...roup.eu, 
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-mm@...ck.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
>
> This is probably a stupid question, but why do we need to take a lock
> just to record this counter? uprobes get away without taking mmap_lock
> even for reads, and still record this seq counter. And then detect
> whether there were any modifications in between. Why does this change
> need more heavy-weight mmap_read_lock to do speculative reads?

Not a stupid question. mmap_read_lock() is used to wait for the writer
to finish what it's doing and then we continue by recording a new
sequence counter value and call mmap_read_unlock. This is what
get_vma_snapshot() does. But your question made me realize that we can
optimize m_start() further by not taking mmap_read_lock at all.
Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
try mmap_lock_speculate_try_begin() and only if it fails do the same
dance we do in the get_vma_snapshot(). I think that should work.

>
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> > Note that this change has a userspace visible disadvantage: it allows
> > for sub-page data tearing as opposed to the previous mechanism where
> > data tearing could happen only between pages of generated output data.
> > Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@...gle.com>
> > ---
> >  fs/proc/internal.h        |   6 ++
> >  fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
> >  include/linux/mm_inline.h |  18 ++++
> >  3 files changed, 177 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 96122e91c645..6e1169c1f4df 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -379,6 +379,12 @@ struct proc_maps_private {
> >         struct task_struct *task;
> >         struct mm_struct *mm;
> >         struct vma_iterator iter;
> > +       bool mmap_locked;
> > +       loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +       unsigned int mm_wr_seq;
> > +       struct vm_area_struct vma_copy;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >         struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -                                               loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > +{
> > +       struct vm_area_struct *copy = &priv->vma_copy;
> > +       int ret = -EAGAIN;
> > +
> > +       memcpy(copy, vma, sizeof(*vma));
> > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > +               goto out;
> > +
> > +       if (!anon_vma_name_get_if_valid(copy))
> > +               goto put_file;
>
> Given vm_file and anon_vma_name are both RCU-protected, if we take
> rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> even need getting/putting them, no?

Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
but pointing to a wrong object even if the rcu grace period did not
pass. With my assumption that seq_file can't rollback once show_map()
is done, I would need a completely stable vma at the time show_map()
is executed so that it does not change from under us while we are
generating the output.
OTOH, if we indeed can rollback while generating seq_file output then
show_map() could output potentially invalid vma, then check for vma
changes and when detected, rollback seq_file and retry again.

>
> I feel like I'm missing some important limitations, but I don't think
> they are spelled out explicitly...
>
> > +
> > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > +               return 0;
> > +
> > +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> > +       rcu_read_unlock();
>
> Another question I have is whether we really need to copy vma into
> priv->vma_copy to have a stable snapshot? Can't we just speculate like
> with uprobes under assumption that data doesn't change. And once we
> are done producing text output, confirm that speculation was
> successful, and if not - retry?
>
> We'd need an API for seq_file to rollback to previous good known
> location for that, but that seems straightforward enough to do, no?
> Just remember buffer position before speculation, write data, check
> for no mm modifications, and if something changed, rollback seq file
> to last position.

>From looking at seq_read_iter() I think for a rollback we would have
to reset seq_file.count and seq_file.index to their previous states.
At this location:
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if
show_map() returns negative value m->count will indeed be rolled back
but not seq_file.index. Also returning negative value at
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230
would be interpreted as a hard error... So, I'll need to spend some
time in this code to get the answer about rollback.
Thanks for the review!

>
>
> > +       ret = mmap_read_lock_killable(priv->mm);
> > +       if (!ret) {
> > +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > +               mmap_read_unlock(priv->mm);
> > +               ret = -EAGAIN;
> > +       }
> > +
> > +       rcu_read_lock();
> > +
> > +       anon_vma_name_put_if_valid(copy);
> > +put_file:
> > +       if (copy->vm_file)
> > +               fput(copy->vm_file);
> > +out:
> > +       return ret;
> > +}
> > +
> > +static void put_vma_snapshot(struct proc_maps_private *priv)
> > +{
> > +       struct vm_area_struct *vma = &priv->vma_copy;
> > +
> > +       anon_vma_name_put_if_valid(vma);
> > +       if (vma->vm_file)
> > +               fput(vma->vm_file);
> > +}
> > +
>
> [...]