[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+EESO5zUO8x21u1KAG5U3Rghaxw8GFGZMhsbM3E2AyeHFYRMw@mail.gmail.com>
Date: Tue, 26 Aug 2025 15:23:28 -0700
From: Lokesh Gidra <lokeshgidra@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>,
Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>, Barry Song <21cnbao@...il.com>,
"open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>, Peter Xu <peterx@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>,
android-mm <android-mm@...gle.com>, linux-kernel <linux-kernel@...r.kernel.org>,
Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
>
> Can we clarify whether this is simply an example, or rather the entire
> motivating reason for raising this issue?
>
When I started off I thought maybe there are other cases too, but it
looks like as of now only uffd MOVE updates folio->mapping to a
different root anon_vma.
> It's important, because it strikes me that this is a very specific use
> case, and you're now suggesting changing core locking to suit it.
>
> While this is a discussion, and I'm glad you raised it, I think it's
> important in these cases to really exhaustively examine all of the possible
> consequences.
>
> OK so to clarify:
>
> - You want to traverse the rmap entirely without any rmap locks whatsoever
> for anon, relying solely on the folio lock to serialise, because
> otherwise rmap read locks here block other rmap write lock calls.
>
There is a misunderstanding. I'm suggesting locking *both* folio as
well as anon_vma during rmap walk. To avoid any confusion, here are
the simplifications in mm/rmap.c that I suggest:
diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..81c177b0cddf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
-retry:
rcu_read_lock();
anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
root_anon_vma = READ_ONCE(anon_vma->root);
if (down_read_trylock(&root_anon_vma->rwsem)) {
- /*
- * folio_move_anon_rmap() might have changed the anon_vma as we
- * might not hold the folio lock here.
- */
- if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
- anon_mapping)) {
- up_read(&root_anon_vma->rwsem);
- rcu_read_unlock();
- goto retry;
- }
-
/*
* If the folio is still mapped, then this anon_vma is still
* its anon_vma, and holding the mutex ensures that it will
@@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
- /*
- * folio_move_anon_rmap() might have changed the anon_vma as we might
- * not hold the folio lock here.
- */
- if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
- anon_mapping)) {
- anon_vma_unlock_read(anon_vma);
- put_anon_vma(anon_vma);
- anon_vma = NULL;
- goto retry;
- }
-
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
* Oops, we held the last refcount, release the lock
@@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked,
if (!folio_raw_mapping(folio))
return 0;
- if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+ if (!is_locked) {
we_locked = folio_trylock(folio);
if (!we_locked)
return 1;
> - You want to unconditionally folio lock all anon and kSM folios for at
> least folio_referenced().
>
Actually file and KSM folios are always locked today. The anon folios
are conditionally left out. So my proposal actually standardizes this
locking, which is an overall simplification.
> In order to resolve a scalability issue specific to a uffd usecase?
>
With the requirement of locking anon_vma in write mode, uffd MOVE
currently is unusable in practice due to poor scalability. The above
change in mm/rmap.c allows us to make the following improvement to
MOVE ioctl:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 45e6290e2e8b..c4fc87d73ab7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
pmd_t dummy_pmdval;
pmd_t dst_pmdval;
struct folio *src_folio = NULL;
- struct anon_vma *src_anon_vma = NULL;
struct mmu_notifier_range range;
int err = 0;
@@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
goto retry;
}
- if (!src_anon_vma) {
- /*
- * folio_referenced walks the anon_vma chain
- * without the folio lock. Serialize against it with
- * the anon_vma lock, the folio lock is not enough.
- */
- src_anon_vma = folio_get_anon_vma(src_folio);
- if (!src_anon_vma) {
- /* page was unmapped from under us */
- err = -EAGAIN;
- goto out;
- }
- if (!anon_vma_trylock_write(src_anon_vma)) {
- pte_unmap(src_pte);
- pte_unmap(dst_pte);
- src_pte = dst_pte = NULL;
- /* now we can block and wait */
- anon_vma_lock_write(src_anon_vma);
- goto retry;
- }
- }
-
err = move_present_pte(mm, dst_vma, src_vma,
dst_addr, src_addr, dst_pte, src_pte,
orig_dst_pte, orig_src_pte, dst_pmd,
@@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
}
out:
- if (src_anon_vma) {
- anon_vma_unlock_write(src_anon_vma);
- put_anon_vma(src_anon_vma);
- }
if (src_folio) {
folio_unlock(src_folio);
folio_put(src_folio);
> Is this the case? Happy to be corrected if I've misinterpreted.
>
> I don't see how this could possibly work, unless I'm missing something
> here, because:
>
> 1. When we lock anon_vma's it's at the root which covers all anon_vma's
> covering parent/children of forked processes.
>
> 2. We do "top down" operations that acquire the rmap lock on the assumption
> we have exclusive access to the rmapping that have nothing to do with
> the folio nor could we even know what the folio is at this point.
>
> 3. We manipulate higher level page tables on the basis that the rmap lock
> excludes other page table walkers.
>
> So this proposal seems to violate all of that?
>
> For instance, in many VMA operations we perform:
>
> anon_vma_interval_tree_pre_update_vma()
>
> and
>
> anon_vma_interval_tree_post_update_vma()
>
> Which removes _all_ R/B tree mappings.
>
> So you can now race with this (it of course doesn't care about folio lock)
> and then get completely incorrect results?
>
> This seems fairly disasterous?
>
> In free_pgtables() also we call unlink_anon_vmas() which iterates through
> the vma->anon_vma_chain and uses the anon lock to tear down higher order
> page tables which you now might race with and that seems even more
> disasterous...
>
>
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > we_locked = folio_trylock(folio);
> > if (!we_locked)
> > return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk(). This helps avoid locking
> > anon_vma when updating folio->mapping, which, for instance, will help
> > eliminate the uninterruptible sleep observed in the field traces
> > mentioned earlier. Furthermore, it enables us to simplify the code in
> > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > field hasn’t changed under us.
>
>
> I mean this is why I get confused here though, because you seem to be
> saying 'don't take rmap lock at all' to referencing
> folio_lock_anon_vma_read()?
>
> Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
> then I don't see how you impact contention on the anon_vma lock by making
> this change?
>
> I think in general - let's clarify what exactly you intend to do here, and
> then we need to delineate what we need to confirm and test to have any
> confidence in making such a change.
>
> anon_vma locks (and rmap locks) are very very sensitive in general and
> we've had actual security issues come up due to race windows emerging from
> inappropriate handling, not to mention that performance around this
> obviously matters a great deal.
I couldn't agree more. My changes seemed to simplify, otherwise I
wouldn't have suggested this. And David's reply yesterday gives
confidence that it wouldn't negatively affect performance either.
Thanks,
Lokesh
>
> So we must tread carefully here.
>
> Thanks, Lorenzo
Powered by blists - more mailing lists