[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3a9dbb6-b1f3-43f0-89ae-85c0aef16bdf@lucifer.local>
Date: Thu, 28 Aug 2025 12:31:34 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
Barry Song <21cnbao@...il.com>,
"open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
Kalesh Singh <kaleshsingh@...gle.com>,
android-mm <android-mm@...gle.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
Vlastimil Babka <vbabka@...e.cz>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
On Tue, Aug 26, 2025 at 03:23:28PM -0700, Lokesh Gidra wrote:
> On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over
Hmm, I started by responding below but now have a vague thought of:
What if we find a way to somehow detect this scenario, and mark the
anon_vma in some way to indicate that a folio lock should be tried?
That'd be a lot less egregious than changing things fundamentally for
everyone.
> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> >
> > Can we clarify whether this is simply an example, or rather the entire
> > motivating reason for raising this issue?
> >
> When I started off I thought maybe there are other cases too, but it
> looks like as of now only uffd MOVE updates folio->mapping to a
> different root anon_vma.
Yup, I mean I looked into making mremap() do it, but it was insanely
difficult to make it work (sadly!) But indeed.
I think it's important to highlight that this is the use case.
I wonder if we can't do something specific to uffd then that would be less
potentially problematic for the rest of core.
Because I just don't really see this as upstreamable otherwise.
>
> > It's important, because it strikes me that this is a very specific use
> > case, and you're now suggesting changing core locking to suit it.
> >
> > While this is a discussion, and I'm glad you raised it, I think it's
> > important in these cases to really exhaustively examine all of the possible
> > consequences.
> >
> > OK so to clarify:
> >
> > - You want to traverse the rmap entirely without any rmap locks whatsoever
> > for anon, relying solely on the folio lock to serialise, because
> > otherwise rmap read locks here block other rmap write lock calls.
> >
> There is a misunderstanding. I'm suggesting locking *both* folio as
> well as anon_vma during rmap walk. To avoid any confusion, here are
> the simplifications in mm/rmap.c that I suggest:
OK. Well that's less extreme :)
But then, if we're taking both locks, how does this prevent contention on
the anon_vma lock?
Even so, this is adding a bunch of overhead.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..81c177b0cddf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
> struct anon_vma *root_anon_vma;
> unsigned long anon_mapping;
>
> -retry:
> rcu_read_lock();
> anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
> if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
> anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
> root_anon_vma = READ_ONCE(anon_vma->root);
> if (down_read_trylock(&root_anon_vma->rwsem)) {
> - /*
> - * folio_move_anon_rmap() might have changed the anon_vma as we
> - * might not hold the folio lock here.
> - */
> - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> - anon_mapping)) {
> - up_read(&root_anon_vma->rwsem);
> - rcu_read_unlock();
> - goto retry;
> - }
> -
> /*
> * If the folio is still mapped, then this anon_vma is still
> * its anon_vma, and holding the mutex ensures that it will
> @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
> rcu_read_unlock();
> anon_vma_lock_read(anon_vma);
>
> - /*
> - * folio_move_anon_rmap() might have changed the anon_vma as we might
> - * not hold the folio lock here.
> - */
> - if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> - anon_mapping)) {
> - anon_vma_unlock_read(anon_vma);
> - put_anon_vma(anon_vma);
> - anon_vma = NULL;
> - goto retry;
> - }
> -
> if (atomic_dec_and_test(&anon_vma->refcount)) {
> /*
> * Oops, we held the last refcount, release the lock
> @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> if (!folio_raw_mapping(folio))
> return 0;
>
> - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> + if (!is_locked) {
> we_locked = folio_trylock(folio);
> if (!we_locked)
> return 1;
This is still a really big change, we're going to be contending the folio
lock potentially a LOT more, for the sake of a very specific and peculiar
uffd use case.
It's hard to justify. And any such justification would need _really
serious_ testing on very many arches/workloads to even come close to being
ok in my view.
This is a pretty huge ask + it's for a specific use case.
>
> > - You want to unconditionally folio lock all anon and kSM folios for at
> > least folio_referenced().
> >
> Actually file and KSM folios are always locked today. The anon folios
> are conditionally left out. So my proposal actually standardizes this
> locking, which is an overall simplification.
Right yes sorry misspoke I meant to say anon.
This is a HUGE exception though, because that covers the majority of a
process's memory allocation.
>
> > In order to resolve a scalability issue specific to a uffd usecase?
> >
> With the requirement of locking anon_vma in write mode, uffd MOVE
> currently is unusable in practice due to poor scalability. The above
> change in mm/rmap.c allows us to make the following improvement to
> MOVE ioctl:
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 45e6290e2e8b..c4fc87d73ab7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
> pmd_t dummy_pmdval;
> pmd_t dst_pmdval;
> struct folio *src_folio = NULL;
> - struct anon_vma *src_anon_vma = NULL;
> struct mmu_notifier_range range;
> int err = 0;
>
> @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
> goto retry;
> }
>
> - if (!src_anon_vma) {
> - /*
> - * folio_referenced walks the anon_vma chain
> - * without the folio lock. Serialize against it with
> - * the anon_vma lock, the folio lock is not enough.
> - */
> - src_anon_vma = folio_get_anon_vma(src_folio);
> - if (!src_anon_vma) {
> - /* page was unmapped from under us */
> - err = -EAGAIN;
> - goto out;
> - }
> - if (!anon_vma_trylock_write(src_anon_vma)) {
> - pte_unmap(src_pte);
> - pte_unmap(dst_pte);
> - src_pte = dst_pte = NULL;
> - /* now we can block and wait */
> - anon_vma_lock_write(src_anon_vma);
> - goto retry;
> - }
> - }
> -
> err = move_present_pte(mm, dst_vma, src_vma,
> dst_addr, src_addr, dst_pte, src_pte,
> orig_dst_pte, orig_src_pte, dst_pmd,
> @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
> }
>
> out:
> - if (src_anon_vma) {
> - anon_vma_unlock_write(src_anon_vma);
> - put_anon_vma(src_anon_vma);
> - }
> if (src_folio) {
> folio_unlock(src_folio);
> folio_put(src_folio);
Right, but again it's a niche use case, sorry. Changing how _the whole
system_ does rmap to suit a very specific use case isn't really a viable
approach.
>
>
> > Is this the case? Happy to be corrected if I've misinterpreted.
> >
> > I don't see how this could possibly work, unless I'm missing something
> > here, because:
> >
> > 1. When we lock anon_vma's it's at the root which covers all anon_vma's
> > covering parent/children of forked processes.
> >
> > 2. We do "top down" operations that acquire the rmap lock on the assumption
> > we have exclusive access to the rmapping that have nothing to do with
> > the folio nor could we even know what the folio is at this point.
> >
> > 3. We manipulate higher level page tables on the basis that the rmap lock
> > excludes other page table walkers.
> >
> > So this proposal seems to violate all of that?
> >
> > For instance, in many VMA operations we perform:
> >
> > anon_vma_interval_tree_pre_update_vma()
> >
> > and
> >
> > anon_vma_interval_tree_post_update_vma()
> >
> > Which removes _all_ R/B tree mappings.
> >
> > So you can now race with this (it of course doesn't care about folio lock)
> > and then get completely incorrect results?
> >
> > This seems fairly disasterous?
> >
> > In free_pgtables() also we call unlink_anon_vmas() which iterates through
> > the vma->anon_vma_chain and uses the anon lock to tear down higher order
> > page tables which you now might race with and that seems even more
> > disasterous...
> >
> >
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > we_locked = folio_trylock(folio);
> > > if (!we_locked)
> > > return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > anon_vma when updating folio->mapping, which, for instance, will help
> > > eliminate the uninterruptible sleep observed in the field traces
> > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > field hasn’t changed under us.
> >
> >
> > I mean this is why I get confused here though, because you seem to be
> > saying 'don't take rmap lock at all' to referencing
> > folio_lock_anon_vma_read()?
> >
> > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
> > then I don't see how you impact contention on the anon_vma lock by making
> > this change?
> >
> > I think in general - let's clarify what exactly you intend to do here, and
> > then we need to delineate what we need to confirm and test to have any
> > confidence in making such a change.
> >
> > anon_vma locks (and rmap locks) are very very sensitive in general and
> > we've had actual security issues come up due to race windows emerging from
> > inappropriate handling, not to mention that performance around this
> > obviously matters a great deal.
>
> I couldn't agree more. My changes seemed to simplify, otherwise I
> wouldn't have suggested this. And David's reply yesterday gives
> confidence that it wouldn't negatively affect performance either.
This isn't a simplification though, this is taking a new lock in a core mm
code path _for everyone_ for a specific UFFD use case. Everyone _but_ people
using this UFFD stuff will just pay an overhead.
Another aspect is you're now making it much more likely that taking the
lock will fail, since it's a trylock...
It really needs very deep analysis and justification for me to be any way
convinced this is ok.
It's hard to justify and I think any workable solution would need to know
this case applied.
A very simple but horrible thing could be to have a config flag to
enable/disable this. An even more really truly horrible thing would be a
prctl()... but let's not.
>
> Thanks,
> Lokesh
> >
> > So we must tread carefully here.
> >
> > Thanks, Lorenzo
I remain very very skeptical of this as-is. Of course always happy to be
convinced otherwise...
Powered by blists - more mailing lists