[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3460045b-6a33-49b4-84a1-e5603c1a2207@lucifer.local>
Date: Wed, 27 Aug 2025 12:16:44 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
David Hildenbrand <david@...hat.com>, Kees Cook <kees@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>,
Shakeel Butt <shakeel.butt@...ux.dev>, Mike Rapoport <rppt@...nel.org>,
Michal Hocko <mhocko@...e.com>, Jonathan Corbet <corbet@....net>,
Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
Rik van Riel <riel@...riel.com>, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V1 2/2] mm: document when rmap locks can be skipped when
setting need_rmap_locks
On Wed, Aug 27, 2025 at 03:52:12PM +0900, Harry Yoo wrote:
> On Tue, Aug 26, 2025 at 10:46:24AM +0100, Lorenzo Stoakes wrote:
> > On Tue, Aug 26, 2025 at 03:58:48PM +0900, Harry Yoo wrote:
> > > While move_ptes() explains when rmap locks can be skipped, when reading
> > > the code setting pmc.need_rmap_locks it is not immediately obvious when
> > > need_rmap_locks can be false. Add a brief explanation in copy_vma() and
> > > relocate_vma_down(), and add a pointer to the comment in move_ptes().
> > >
> > > Meanwhile, fix and improve the comment in move_ptes().
> > >
> > > Signed-off-by: Harry Yoo <harry.yoo@...cle.com>
> >
> > This is great thanks! :)
>
> You're welcome!
>
> > > ---
> > > mm/mremap.c | 4 +++-
> > > mm/vma.c | 7 +++++++
> > > mm/vma_exec.c | 5 +++++
> > > 3 files changed, 15 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index e618a706aff5..86adb872bea0 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -218,8 +218,10 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > > * When need_rmap_locks is false, we use other ways to avoid
> > > * such races:
> > > *
> > > - * - During exec() shift_arg_pages(), we use a specially tagged vma
> > > + * - During exec() relocate_vma_down(), we use a specially tagged vma
> > > * which rmap call sites look for using vma_is_temporary_stack().
> > > + * Folios mapped in the temporary stack vma cannot be migrated until
> > > + * the relocation is complete.
> >
> > Can we actually move this comment over to move_page_tables()? As this is
> > relevant to the whole operation.
>
> Sounds good, will do.
>
> > Also could you put a comment referencing this
> > comment in copy_vma_and_data() as this is where we actually determine whether
> > this is the case or not in _most cases_.
> >
> > Let's just get all the 'need rmap locks' and 'corner cases where it's fine
> > anyway' in one place that is logical :)
>
> Will do.
>
> > Also could you put a comment in copy_vma() over in mm/vma.c saying 'see the
> > comment in mm/mremap.c' or even risk mentioning the function name (risky as code
> > changes but still :P) e.g. 'see comment in move_page_tables()' or something.
>
> Will take a risk and do "See the comment in move_page_tables()" :)
>
> > I'm confused by the 'folios mapped' and 'migrate' bits - and I think people will
> > be confused by that.
> >
> > I think better to say 'page tables for the temporary stack cannot be adjusted
> > until the relocation is complete'.
>
> But is that correct?
>
> Out of all rmap users, only try_to_migrate() cares about
> VM_STACK_INCOMPLETE_SETUP via invalid_migration_vma().
Right, I think you are being too succinct here then :) as observed in my reading
it this way, I think reasonably somebody might be confused by this also.
Let's make this whole comment clearer.
So what's going on here USUALLY for mremap (not the relocate_vma_down() case
which we will get on to) is really that (assuming new VMA case for now):
1. We create a new VMA at destination.
2. The rmap clones its mappings to the new VMA _as well as_ the old.
Therefore, an rmap traversal will encounter BOTH sets of page tables when
traversing from the folio up the rmap.
Note that the mentioned cloning happens in anon_vma_clone() for anon and/or
vma_link() -> vma_link_file() -> (under i mmap write lock obv.)
__vma_link_file().
The original VMA will only be detached after the operation is complete.
If we unmap say, and say the new VMA is before the old VMA, meaning we dont' get
saved by traversal order, then all that will happen is the try_to_unmap() will
not be able to unmap and the mapcount will stay >0 and all is fine.
All rmap operations are best effort.
The only cases in which we set rmap locks like this are:
- self-merge
- higher order page table moves (see should_take_rmap_locks())
So really all this should talk about the self-merge case, rather than broadly
'why we don't care'.
So the comment's already a mess as it is I think.
So we can self-merge if we mremap() to a place immediately adjacent to the
existing VMA.
In that case, we'll simply change the range of the VMA, without doing any
cloning etc, and move page tables 'internally' within the same VMA.
The rmap traversal order guarantees that racing rmap operations are fine when
you self-merge _after_ a VMA, because we're serialised by PTE PTL, and if you
'miss' an entry in the page tables, you'll 'catch' it again when you traverse
later in the range.
HOWEVER, things are a problem if you self-merge _before_ a VMA.
Examining the relevant bit in copy_vma():
An aside about vma pointer changing...:
new_vma = vma_merge_new_range(&vmg);
if (new_vma) {
Merged---------^
We have firstly to consider whether we get a merge that will change the
actual VMA pointer - this can only be the case (+ in anon case) if:
1. The old VMA is not faulted in yet (otherwise we'd need to keep the
same vma->vm_pgoff) meaning we can change this to make the merge even
possible.
2. There was a VMA prior to the location in which we move that will be
expanded to fill the range.
This means it ha to be _before_ the old VMA for this to happen, e.g.:
if (unlikely(vma_start >= new_vma->vm_start &&
vma_start < new_vma->vm_end)) {
new_vma will be the _merged_ VMA, with vma_start the _old_ VMA's
start. So for this condition to be true we must have:
vs = vma_start (old VMA ->vm_start)
ve = old VMA ->vm_end
nvs = new_vma->vm_start = addr
nve = new_vma->vm_end
mremap():
vs ve
|---------|
| |
|---------|
addr addr + len
|--------|
| |
|--------|
Leads to merge:
nvs nve
vs ve
|--------.---------|
| . |
|--------.---------|
*vmap = vma = new_vma;
Therefore reset VMA--------------^
OK, but if we merge _at all_, we _also_ have to consider the 'before' case, and
pertinent to the rmap lock discussion, this is any case where we get a merge
where _what the rmap indexes on_ would be traversed in a way that could lead to
'seeing' the destination before the source when there is only a single VMA in
play.
Phew.
See how complicated this is? :)
So the pertinent code for this is:
if (new_vma) {
...
*need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff);
} ...
The 'before' is the ->vm_pgoff of the new and old VMA's.
Note that we've covered off the case of the VMA changing, so we don't get a UAF
above, and in that case vma == new_vma and thus vm_pgoff will match.
But if that occurred, for the merge to succeed, the vm_pgoff would _have_ to be
prior in the traversal order, so it's correct to then take rmap locks.
Otherwise, in other merge cases, where there is no 3-way merge with a prior VMA
causing the VMA pointer to change, we simple take rmap locks in the case where
the traversal order would not save us.
So yeah given all the above, I do not think the existing comment is very
helpful.
It's _only_ in this case where we need to care.
So I thinkt he conflated thing here is this edge case of the damn horrible
initial stack relocation.
It's the _only place_ where we invoke move_page_tables() outside of mremap.
And of course we don't take any rmap locks, so in comes the explanation about
why that's ok.
We obviously also _don't apply the copy_vma()_ logic there.
We are vma_expand()'ing so it's a single VMA, which is problematic.
Therefore we are handling this with the vma_is_temporary_stack() check.
I think try_to_unmap() is ok to race, as if you race and don't find the page
table entries, you won't get to the folio + thus won't get to update the
mapcount so it's fine.
I think migration is very much NOT fine, because suddenly you have migration
entries that might get left behind as per a8bef8ff6ea1.
OK phew.
I am going to leave it to you to find a way to write the above up :)
Also please reflect, as far as is sensible, in the docs.
A sort of vague beginning might be something like:
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
->
When dealing with racing rmap operations, we must be careful, as the
purpose of an rmap walk operation is to traverse from a folio to linked
VMAs, and then typically back down again through the page tables
mapping them.
Since we are manipulating page tables here, we may interfere with an
rmap operation if we are not careful about locking.
The only case where this is meaningful is a self-merge... blah
etc. etc. etc. :)
>
> I'm not sure what prevents from try_to_unmap() from unmapping it while
> it's relocated?
See above.
>
> Looks like it's always been like this since a8bef8ff6ea1 ("mm: migration:
> avoid race between shift_arg_pages() and rmap_walk() during migration by
> not migrating temporary stacks")
>
> > > *
> > > * - During mremap(), new_vma is often known to be placed after vma
> > > * in rmap traversal order. This ensures rmap will always observe
> >
> > This whole bit after could really do with some ASCII diagrams btw :)) ;) but you
> > know maybe out of scope here.
> >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 3b12c7579831..3da49f79e9ba 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -1842,6 +1842,11 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > > vmg.next = vma_iter_next_rewind(&vmi, NULL);
> > > new_vma = vma_merge_new_range(&vmg);
> > >
> > > + /*
> > > + * rmap locks can be skipped as long as new_vma is traversed
> > > + * after vma during rmap walk (new_vma->vm_pgoff >= vma->vm_pgoff).
> > > + * See the comment in move_ptes().
> > > + */
> >
> > Obv. would prefer this to say 'move_page_tables()' as mentioned above :P
>
> Will do.
>
> > > if (new_vma) {
> > > /*
> > > * Source vma may have been merged into new_vma
> > > @@ -1879,6 +1884,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > > new_vma->vm_ops->open(new_vma);
> > > if (vma_link(mm, new_vma))
> > > goto out_vma_link;
> > > +
> > > + /* new_vma->pg_off is always >= vma->pg_off if not merged */
> >
> > Err, new_vma is NULL? :) I'm not sure this comment is too useful.
>
> Sometimes the line between "worth commenting" and "too much comment" is
> vague to me :) I'll remove it. Thanks.
It's fine :) instinct to comment more is nice :)
>
> > > *need_rmap_locks = false;
> > > }
> > > return new_vma;
> > > diff --git a/mm/vma_exec.c b/mm/vma_exec.c
> > > index 922ee51747a6..a895dd39ac46 100644
> > > --- a/mm/vma_exec.c
> > > +++ b/mm/vma_exec.c
> > > @@ -63,6 +63,11 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > > * process cleanup to remove whatever mess we made.
> > > */
> > > pmc.for_stack = true;
> > > + /*
> > > + * pmc.need_rmap_locks is false since rmap locks can be safely skipped
> > > + * because migration is disabled for this vma during relocation.
> > > + * See the comment in move_ptes().
> > > + */
> >
> > Let's reword this also, people will get confused about migration here.
> >
> > 'pmc.need_rmap_locks is false since rmap explicitly checks for
> > vma_is_temporary_stack() and thus extra care does not need to be taken here
> > during stack relocation. See the comment in move_page_tables().'
>
> This looks good! except for one thing, not all rmap users check for
> vma_is_temporary_stack().
Yup see above!
>
> > > if (length != move_page_tables(&pmc))
> > > return -ENOMEM;
> > >
> > > --
> > > 2.43.0
> > >
> >
> > Cheers, Lorenzo
>
> --
> Cheers,
> Harry / Hyeonggon
Cheers, Lorenzo
Powered by blists - more mailing lists