linux-kernel - Re: [DISCUSSION] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3a9dbb6-b1f3-43f0-89ae-85c0aef16bdf@lucifer.local>
Date: Thu, 28 Aug 2025 12:31:34 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

On Tue, Aug 26, 2025 at 03:23:28PM -0700, Lokesh Gidra wrote:
> On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over

Hmm, I started by responding below but now have a vague thought of:

What if we find a way to somehow detect this scenario, and mark the
anon_vma in some way to indicate that a folio lock should be tried?

That'd be a lot less egregious than changing things fundamentally for
everyone.

> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> >
> > Can we clarify whether this is simply an example, or rather the entire
> > motivating reason for raising this issue?
> >
> When I started off I thought maybe there are other cases too, but it
> looks like as of now only uffd MOVE updates folio->mapping to a
> different root anon_vma.

Yup, I mean I looked into making mremap() do it, but it was insanely
difficult to make it work (sadly!) But indeed.

I think it's important to highlight that this is the use case.

I wonder if we can't do something specific to uffd then that would be less
potentially problematic for the rest of core.

Because I just don't really see this as upstreamable otherwise.

>
> > It's important, because it strikes me that this is a very specific use
> > case, and you're now suggesting changing core locking to suit it.
> >
> > While this is a discussion, and I'm glad you raised it, I think it's
> > important in these cases to really exhaustively examine all of the possible
> > consequences.
> >
> > OK so to clarify:
> >
> > - You want to traverse the rmap entirely without any rmap locks whatsoever
> >   for anon, relying solely on the folio lock to serialise, because
> >   otherwise rmap read locks here block other rmap write lock calls.
> >
> There is a misunderstanding. I'm suggesting locking *both* folio as
> well as anon_vma during rmap walk. To avoid any confusion, here are
> the simplifications in mm/rmap.c that I suggest:

OK. Well that's less extreme :)

But then, if we're taking both locks, how does this prevent contention on
the anon_vma lock?

Even so, this is adding a bunch of overhead.

>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..81c177b0cddf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
>
> -retry:
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>         if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);
>         if (down_read_trylock(&root_anon_vma->rwsem)) {
> -               /*
> -                * folio_move_anon_rmap() might have changed the anon_vma as we
> -                * might not hold the folio lock here.
> -                */
> -               if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -                            anon_mapping)) {
> -                       up_read(&root_anon_vma->rwsem);
> -                       rcu_read_unlock();
> -                       goto retry;
> -               }
> -
>                 /*
>                  * If the folio is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
> @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         rcu_read_unlock();
>         anon_vma_lock_read(anon_vma);
>
> -       /*
> -        * folio_move_anon_rmap() might have changed the anon_vma as we might
> -        * not hold the folio lock here.
> -        */
> -       if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -                    anon_mapping)) {
> -               anon_vma_unlock_read(anon_vma);
> -               put_anon_vma(anon_vma);
> -               anon_vma = NULL;
> -               goto retry;
> -       }
> -
>         if (atomic_dec_and_test(&anon_vma->refcount)) {
>                 /*
>                  * Oops, we held the last refcount, release the lock
> @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;

This is still a really big change, we're going to be contending the folio
lock potentially a LOT more, for the sake of a very specific and peculiar
uffd use case.

It's hard to justify. And any such justification would need _really
serious_ testing on very many arches/workloads to even come close to being
ok in my view.

This is a pretty huge ask + it's for a specific use case.

>
> > - You want to unconditionally folio lock all anon and kSM folios for at
> >   least folio_referenced().
> >
> Actually file and KSM folios are always locked today. The anon folios
> are conditionally left out. So my proposal actually standardizes this
> locking, which is an overall simplification.

Right yes sorry misspoke I meant to say anon.

This is a HUGE exception though, because that covers the majority of a
process's memory allocation.

>
> > In order to resolve a scalability issue specific to a uffd usecase?
> >
> With the requirement of locking anon_vma in write mode, uffd MOVE
> currently is unusable in practice due to poor scalability. The above
> change in mm/rmap.c allows us to make the following improvement to
> MOVE ioctl:
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 45e6290e2e8b..c4fc87d73ab7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         pmd_t dummy_pmdval;
>         pmd_t dst_pmdval;
>         struct folio *src_folio = NULL;
> -       struct anon_vma *src_anon_vma = NULL;
>         struct mmu_notifier_range range;
>         int err = 0;
>
> @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                         goto retry;
>                 }
>
> -               if (!src_anon_vma) {
> -                       /*
> -                        * folio_referenced walks the anon_vma chain
> -                        * without the folio lock. Serialize against it with
> -                        * the anon_vma lock, the folio lock is not enough.
> -                        */
> -                       src_anon_vma = folio_get_anon_vma(src_folio);
> -                       if (!src_anon_vma) {
> -                               /* page was unmapped from under us */
> -                               err = -EAGAIN;
> -                               goto out;
> -                       }
> -                       if (!anon_vma_trylock_write(src_anon_vma)) {
> -                               pte_unmap(src_pte);
> -                               pte_unmap(dst_pte);
> -                               src_pte = dst_pte = NULL;
> -                               /* now we can block and wait */
> -                               anon_vma_lock_write(src_anon_vma);
> -                               goto retry;
> -                       }
> -               }
> -
>                 err = move_present_pte(mm,  dst_vma, src_vma,
>                                        dst_addr, src_addr, dst_pte, src_pte,
>                                        orig_dst_pte, orig_src_pte, dst_pmd,
> @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         }
>
>  out:
> -       if (src_anon_vma) {
> -               anon_vma_unlock_write(src_anon_vma);
> -               put_anon_vma(src_anon_vma);
> -       }
>         if (src_folio) {
>                 folio_unlock(src_folio);
>                 folio_put(src_folio);

Right, but again it's a niche use case, sorry. Changing how _the whole
system_ does rmap to suit a very specific use case isn't really a viable
approach.

>
>
> > Is this the case? Happy to be corrected if I've misinterpreted.
> >
> > I don't see how this could possibly work, unless I'm missing something
> > here, because:
> >
> > 1. When we lock anon_vma's it's at the root which covers all anon_vma's
> >    covering parent/children of forked processes.
> >
> > 2. We do "top down" operations that acquire the rmap lock on the assumption
> >    we have exclusive access to the rmapping that have nothing to do with
> >    the folio nor could we even know what the folio is at this point.
> >
> > 3. We manipulate higher level page tables on the basis that the rmap lock
> >    excludes other page table walkers.
> >
> > So this proposal seems to violate all of that?
> >
> > For instance, in many VMA operations we perform:
> >
> > anon_vma_interval_tree_pre_update_vma()
> >
> > and
> >
> > anon_vma_interval_tree_post_update_vma()
> >
> > Which removes _all_ R/B tree mappings.
> >
> > So you can now race with this (it of course doesn't care about folio lock)
> > and then get completely incorrect results?
> >
> > This seems fairly disasterous?
> >
> > In free_pgtables() also we call unlink_anon_vmas() which iterates through
> > the vma->anon_vma_chain and uses the anon lock to tear down higher order
> > page tables which you now might race with and that seems even more
> > disasterous...
> >
> >
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >         we_locked = folio_trylock(folio);
> > >         if (!we_locked)
> > >                 return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > anon_vma when updating folio->mapping, which, for instance, will help
> > > eliminate the uninterruptible sleep observed in the field traces
> > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > field hasn’t changed under us.
> >
> >
> > I mean this is why I get confused here though, because you seem to be
> > saying 'don't take rmap lock at all' to referencing
> > folio_lock_anon_vma_read()?
> >
> > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
> > then I don't see how you impact contention on the anon_vma lock by making
> > this change?
> >
> > I think in general - let's clarify what exactly you intend to do here, and
> > then we need to delineate what we need to confirm and test to have any
> > confidence in making such a change.
> >
> > anon_vma locks (and rmap locks) are very very sensitive in general and
> > we've had actual security issues come up due to race windows emerging from
> > inappropriate handling, not to mention that performance around this
> > obviously matters a great deal.
>
> I couldn't agree more. My changes seemed to simplify, otherwise I
> wouldn't have suggested this. And David's reply yesterday gives
> confidence that it wouldn't negatively affect performance either.

This isn't a simplification though, this is taking a new lock in a core mm
code path _for everyone_ for a specific UFFD use case. Everyone _but_ people
using this UFFD stuff will just pay an overhead.

Another aspect is you're now making it much more likely that taking the
lock will fail, since it's a trylock...

It really needs very deep analysis and justification for me to be any way
convinced this is ok.

It's hard to justify and I think any workable solution would need to know
this case applied.

A very simple but horrible thing could be to have a config flag to
enable/disable this. An even more really truly horrible thing would be a
prctl()... but let's not.

>
> Thanks,
> Lokesh
> >
> > So we must tread carefully here.
> >
> > Thanks, Lorenzo

I remain very very skeptical of this as-is. Of course always happy to be
convinced otherwise...