linux-kernel - Re: [RFC] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKhIL3OguViS9myH@hyeyoo>
Date: Fri, 22 Aug 2025 19:36:31 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: Zi Yan <ziy@...dia.com>, Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, David Hildenbrand <david@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [RFC] Unconditionally lock folios when calling rmap_walk()

On Thu, Aug 21, 2025 at 10:56:02AM -0700, Lokesh Gidra wrote:
> On Thu, Aug 21, 2025 at 9:14 AM Zi Yan <ziy@...dia.com> wrote:
> >
> > On 21 Aug 2025, at 8:01, Barry Song wrote:
> >
> > > On Thu, Aug 21, 2025 at 12:29 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> > >>
> > >> Adding linux-mm mailing list. Mistakenly used the wrong email address.

[Cc'ing MEMORY MANAGEMENT - RMAP folks]

> > >> On Wed, Aug 20, 2025 at 9:23 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> Currently, some callers of rmap_walk() conditionally avoid try-locking
> > >>> non-ksm anon folios. This necessitates serialization through anon_vma
> > >>> write-lock when folio->mapping and/or folio->index (fields involved in
> > >>> rmap_walk()) are to be updated.

For non-ksm anon folios, rmap needs to take anon_vma lock.

Simply acquiring the folio lock instead of anon_vma lock isn't enough
1) because the kernel can't stablize anon_vma without anon_vma lock
(an anon_vma cannot be freed while someone's holding anon_vma lock,
see anon_vma_free()).

2) without anon_vma lock the kernel can't reliably unmap folios because
they can be mapped to other processes (by fork()) while the kernel is
iterating list of VMAs that can possibly map the folio. fork() doens't
and shouldn't acquire folio lock.

3) Holding anon_vma lock also prevents anon_vma_chains from
   being freed while holding the lock.

[Are there more things to worry about that I missed?
 Please add them if so]

Any idea to relax locking requirements while addressing these
requirements?

If some users don't care about missing some PTE A bits due to race
against fork() (perhaps folio_referenced()?), a crazy idea might be to
RCU-protect anon_vma_chains (but then we need to check if the VMA is
still alive) and use refcount to stablize anon_vmas?

> > >>> This hurts scalability due to coarse
> > >>> granularity of the lock. For instance, when multiple threads invoke
> > >>> userfaultfd’s MOVE ioctl simultaneously to move distinct pages from
> > >>> the same src VMA, they all contend for the corresponding anon_vma’s
> > >>> lock. Field traces for arm64 android devices reveal over 30ms of
> > >>> uninterruptible sleep in the main UI thread, leading to janky user
> > >>> interactions.
> > >>>
> > >>> Among all rmap_walk() callers that don’t lock anon folios,
> > >>> folio_referenced() is the most critical (others are
> > >>> page_idle_clear_pte_refs(), damon_folio_young(), and
> > >>> damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >>>
> > >>> if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >>>         we_locked = folio_trylock(folio);
> > >>>         if (!we_locked)
> > >>>                 return 1;
> > >>> }
> >
> > This seems to be legacy code from commit 5ad6468801d2 ("ksm: let shared pages be
> > swappable"). From the commit log, the lock is used to protect KSM stable
> > tree from concurrent modification.
> >
> It seems like the conditional locking of file page/folio was added in
> a 2004 commit  edcc56dc6a7c758c ("maplock: kill page_map_lock"). Later
> in the commit you mentioned locking was also added for KSM, and now
> only non-KSM anon folios are left :-)
>
> > >>> It’s unclear why locking anon_vma (when updating folio->mapping) is
> > >>> beneficial over locking the folio here. It’s in the reclaim path, so
> > >>> should not be a critical path that necessitates some special
> > >>> treatment, unless I’m missing something.
> >
> > The decision was made before the first git commit 1da177e4c3f4 based on
> > git history. Maybe it is time to revisit it and improve it.
> >
> >
> > >>> Therefore, I propose simplifying the locking mechanism by
> > >>> unconditionally try-locking the folio in such cases. This helps avoid
> > >>> locking anon_vma when updating folio->mapping, which, for instance,
> > >>> will help eliminate the uninterruptible sleep observed in the field
> > >>> traces mentioned earlier.

As mentioned above, that isn't enough.

-- 
Cheers,
Harry / Hyeonggon

> > >>> Furthermore, it enables us to simplify the
> > >>> code in folio_lock_anon_vma_read() by removing the re-check to ensure
> > >>> that the field hasn’t changed under us.
> > >
> > > Thanks, I’m personally quite interested in this topic and will take a
> > > closer look as well. Beyond this one userfaultfd move, we’ve observed
> > > severe anon_vma lock contention between fork, unmap (process exit), and
> > > memory reclamation. This has caused noticeable UI stutters, especially
> > > when many VMAs share the same anon_vma root.