linux-kernel - Re: [DISCUSSION] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKxAWqvTxctPoumF@hyeyoo>
Date: Mon, 25 Aug 2025 19:52:10 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

On Sat, Aug 23, 2025 at 11:45:03PM -0700, Lokesh Gidra wrote:
> On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@...cle.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> > > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > > to coarse granularity of the lock. For instance, when multiple threads
> > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > > from the same src VMA, they all contend for the corresponding
> > > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > > user interactions.
> > > >
> > > > Among all rmap_walk() callers that don’t lock anon folios,
> > > > folio_referenced() is the most critical (others are
> > > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > > >
> > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > >         we_locked = folio_trylock(folio);
> > > >         if (!we_locked)
> > > >                 return 1;
> > > > }
> > > >
> > > > It’s unclear why locking anon_vma exclusively (when updating
> > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > > with folio locked. It’s in the reclaim path, so should not be a
> > > > critical path that necessitates some special treatment, unless I’m
> > > > missing something.
> > > >
> > > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > > anon_vma when updating folio->mapping, which, for instance, will help
> > > > eliminate the uninterruptible sleep observed in the field traces
> > > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > > field hasn’t changed under us.
> > > Hi Harry,
> > >
> > > Your comment [1] in the other thread was quite useful and also needed
> > > to be responded to. So bringing it here for continuing discussion.
> >
> > Hi Lokesh,
> >
> > Here I'm quoting my previous comment for discussion. I should have done it
> > earlier but you know, it was Friday night in Korea :)
> 
> No problem at all. :)
> >
> > My previous comment was:
> >   Simply acquiring the folio lock instead of anon_vma lock isn't enough
> >   1) because the kernel can't stablize anon_vma without anon_vma lock
> >   (an anon_vma cannot be freed while someone's holding anon_vma lock,
> >   see anon_vma_free()).
> >
> >   2) without anon_vma lock the kernel can't reliably unmap folios because
> >   they can be mapped to other processes (by fork()) while the kernel is
> >   iterating list of VMAs that can possibly map the folio. fork() doens't
> >   and shouldn't acquire folio lock.
> >
> >   3) Holding anon_vma lock also prevents anon_vma_chains from
> >      being freed while holding the lock.
> >
> >   [Are there more things to worry about that I missed?
> >    Please add them if so]
> >
> >   Any idea to relax locking requirements while addressing these
> >   requirements?
> >
> >   If some users don't care about missing some PTE A bits due to race
> >   against fork() (perhaps folio_referenced()?), a crazy idea might be to
> >   RCU-protect anon_vma_chains (but then we need to check if the VMA is
> >   still alive) and use refcount to stablize anon_vmas?
> >
> > > It seems from your comment that you misunderstood my proposal. I am
> > > not suggesting replacing anon_vma lock with folio lock during rmap
> > > walk. Clearly, it is essential for all the reasons that you
> > > enumerated. My proposal is to lock anon folios during rmap_walk(),
> > > like file and KSM folios.
> >
> > Still not sure if I follow your proposal. Let's clarify a little bit.
> >
> > As anon_vma lock is reader-writer semaphore, maybe you're saying
> > 1) readers should acquire both folio lock and anon_vma lock, and
> >
> > > This helps in improving scalability (and also simplifying code in
> > > folio_lock_anon_vma_read()) as then we can serialize on folio lock
> > > instead of anon_vma lock when moving the folio to a different root
> > > anon_vma in folio_move_anon_rmap() [2].
> >
> > 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
> >    simply update folio->index and folio->mapping, and they should be able
> >    to run in parallel if they're not updating the same folio,
> >    by taking folio lock and avoiding anon_vma lock?
> 
> Yes, that's exactly what I am hoping to achieve.

I think technically that should work. I can't think of a way this could
go wrong (Or maybe I just lack imagination :/ ).

Adding folio_lock/unlock in folio_referenced() is likely to affect
reclamation performance. How bad is it? I don't know, need data.

And adding cost only for the uffd case might not be justifiable...

> > I see a comment in move_pages_pte():
> > /*
> >  * folio_referenced walks the anon_vma chain
> >  * without the folio lock. Serialize against it with
> >  * the anon_vma lock, the folio lock is not enough.
> >  */
> >
> > > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> > > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > > > Thanks,
> > > > Lokesh

-- 
Cheers,
Harry / Hyeonggon