linux-kernel - Re: [DISCUSSION] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <65dd5d54-87ab-49a8-8734-2201a0014feb@lucifer.local>
Date: Tue, 26 Aug 2025 16:52:20 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> Hi all,
>
> Currently, some callers of rmap_walk() conditionally avoid try-locking
> non-ksm anon folios. This necessitates serialization through anon_vma
> write-lock elsewhere when folio->mapping and/or folio->index (fields
> involved in rmap_walk()) are to be updated. This hurts scalability due
> to coarse granularity of the lock. For instance, when multiple threads
> invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> from the same src VMA, they all contend for the corresponding
> anon_vma’s lock. Field traces for arm64 android devices reveal over
> 30ms of uninterruptible sleep in the main UI thread, leading to janky
> user interactions.

Can we clarify whether this is simply an example, or rather the entire
motivating reason for raising this issue?

It's important, because it strikes me that this is a very specific use
case, and you're now suggesting changing core locking to suit it.

While this is a discussion, and I'm glad you raised it, I think it's
important in these cases to really exhaustively examine all of the possible
consequences.

OK so to clarify:

- You want to traverse the rmap entirely without any rmap locks whatsoever
  for anon, relying solely on the folio lock to serialise, because
  otherwise rmap read locks here block other rmap write lock calls.

- You want to unconditionally folio lock all anon and kSM folios for at
  least folio_referenced().

In order to resolve a scalability issue specific to a uffd usecase?

Is this the case? Happy to be corrected if I've misinterpreted.

I don't see how this could possibly work, unless I'm missing something
here, because:

1. When we lock anon_vma's it's at the root which covers all anon_vma's
   covering parent/children of forked processes.

2. We do "top down" operations that acquire the rmap lock on the assumption
   we have exclusive access to the rmapping that have nothing to do with
   the folio nor could we even know what the folio is at this point.

3. We manipulate higher level page tables on the basis that the rmap lock
   excludes other page table walkers.

So this proposal seems to violate all of that?

For instance, in many VMA operations we perform:

anon_vma_interval_tree_pre_update_vma()

and

anon_vma_interval_tree_post_update_vma()

Which removes _all_ R/B tree mappings.

So you can now race with this (it of course doesn't care about folio lock)
and then get completely incorrect results?

This seems fairly disasterous?

In free_pgtables() also we call unlink_anon_vmas() which iterates through
the vma->anon_vma_chain and uses the anon lock to tear down higher order
page tables which you now might race with and that seems even more
disasterous...


>
> Among all rmap_walk() callers that don’t lock anon folios,
> folio_referenced() is the most critical (others are
> page_idle_clear_pte_refs(), damon_folio_young(), and
> damon_folio_mkold()). The relevant code in folio_referenced() is:
>
> if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
>         we_locked = folio_trylock(folio);
>         if (!we_locked)
>                 return 1;
> }
>
> It’s unclear why locking anon_vma exclusively (when updating
> folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> with folio locked. It’s in the reclaim path, so should not be a
> critical path that necessitates some special treatment, unless I’m
> missing something.
> Therefore, I propose simplifying the locking mechanism by ensuring the
> folio is locked before calling rmap_walk(). This helps avoid locking
> anon_vma when updating folio->mapping, which, for instance, will help
> eliminate the uninterruptible sleep observed in the field traces
> mentioned earlier. Furthermore, it enables us to simplify the code in
> folio_lock_anon_vma_read() by removing the re-check to ensure that the
> field hasn’t changed under us.


I mean this is why I get confused here though, because you seem to be
saying 'don't take rmap lock at all' to referencing
folio_lock_anon_vma_read()?

Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
then I don't see how you impact contention on the anon_vma lock by making
this change?

I think in general - let's clarify what exactly you intend to do here, and
then we need to delineate what we need to confirm and test to have any
confidence in making such a change.

anon_vma locks (and rmap locks) are very very sensitive in general and
we've had actual security issues come up due to race windows emerging from
inappropriate handling, not to mention that performance around this
obviously matters a great deal.

So we must tread carefully here.

Thanks, Lorenzo