linux-kernel - Re: [DISCUSSION] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f0f9564e-7b4f-4799-b335-55b47fa7bbc3@lucifer.local>
Date: Fri, 29 Aug 2025 10:01:12 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

On Thu, Aug 28, 2025 at 05:23:56PM -0700, Lokesh Gidra wrote:
> On Thu, Aug 28, 2025 at 5:04 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote:
> > > On 22.08.25 19:29, Lokesh Gidra wrote:
> > > > Hi all,
> > > >
> > > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > > to coarse granularity of the lock. For instance, when multiple threads
> > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > > from the same src VMA, they all contend for the corresponding
> > > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > > user interactions.
> > > >
> > > > Among all rmap_walk() callers that don’t lock anon folios,
> > > > folio_referenced() is the most critical (others are
> > > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > > >
> > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > >          we_locked = folio_trylock(folio);
> > > >          if (!we_locked)
> > > >                  return 1;
> > > > }
> > > >
> > > > It’s unclear why locking anon_vma exclusively (when updating
> > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > > with folio locked. It’s in the reclaim path, so should not be a
> > > > critical path that necessitates some special treatment, unless I’m
> > > > missing something.
> > > >
> > > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > > folio is locked before calling rmap_walk().
> > >
> > > Essentially, what you mean is roughly:
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 34333ae3bd80f..0800e73c0796e 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> > >         if (!folio_raw_mapping(folio))
> > >                 return 0;
> > > -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > +       if (!is_locked) {
> > >                 we_locked = folio_trylock(folio);
> > >                 if (!we_locked)
> > >                         return 1;
> > >
> > >
> > > The downside of that change is that ordinary (!ksm) folios will observe being locked
> >
> > Well anon folios, I guess this is what you meant :)
> >
> > > when we are actually only trying to asses if they were referenced.
> > >
> > > Does it matter?
> >
> > Also another downside is we try lock and abort if we fail, so we've made this
> > conditional on that.
> >
> > But surely this is going to impact reclaim performance esp. under heavy memory
> > pressure? It is at least a trylock.
> >
> > It's dangerous waters, and I'd really want some detailed data + analysis to
> > prove the point here, I don't think theorising about it is enough.
> >
> > >
> > > I can only speculate that it might have been very relevant before
> > > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
> > >
> > > Essentially any R/O fault would have resulted in us copying the page, simply because
> > > there is concurrent folio_referenced() happening.
> >
> > Fun.
> >
> > Thing is we now have to consider _every case_ where a contention might cause an
> > issue.
> >
> > One thing I _was_ concerned about was:
> >
> > - uffd move locks folios
> > - now folio_referenced() 'fails' returning 1
> >
> > But case 2 is only in shrink_active_list() which uses this as a boolean...
> >
> > OK so maybe fine for this one.
>
> For shrink_active_list() it seems like doesn't matter what
> folio_referenced() returns unless it's an executable file-backed
> folio.

Yes agreed, was chatting with David I think yesterday (my review load atm makes
remembering when I did stuff harder :P) and was going through the code as a
result and had a closer look and you're right.

So it returning 1 is fine.

> >
> > I really do also hate that any future callers are going to possibly be confused
> > about how this function works, but I guess it was already 'weird' for
> > file-backed/KSM.
>
> Actually, wouldn't the simplification remove the already existing
> confusion, rather than adding to it? :)
> We can then simply say, rmap_walk() expects folio to be locked.

Yeah it does simplify in that sense, the real issue is - will we see contention
in some workloads.

I'm sort of gradually softening on this as we talk... but I feel like we
really need to check this more thoroughly.

>
> >
> > So the issue remains really - folio lock contention as a result of this.
> >
> > It's one thing to theorise, but you may be forgetting something... and then
> > we've changed an absolutely core thing to suit a niche UFFD use case.
>
> I really wish there was a way to avoid this within the UFFD code :( In
> fact, the real pain point is multiple UFFD threads contending for
> write-lock on anon_vma, even when they don't need to serialize among
> themselves.
> >
> > I do wonder if we can identify this case and handle things differently.
> >
> > Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> > contention, grab the folio lock.
>
> Can you please elaborate what you mean? Where do you mean we can
> possibly do something like this?

It's vague hand waving, but generally if we could detect contention then we
could conditionally change how we handle this, perhaps...

>
> UFFD move only works on PageAnonExclusive folios. So, would it help
> (in terms of avoiding contention) if we were to change the condition:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..1638e27347e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1006,7 +1006,8 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio) ||
> +                          PageAnonExclusive(&folio->page))) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;
>
> Obviously, this is opposite of simplification :)
>
> But as we know that shrink_active_list() uses this as a boolean, so do
> we even need to walk rmap for PageAnonExclusive folios? Can't we
> simply do:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..a26523de321f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1006,10 +1006,14 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> -               we_locked = folio_trylock(folio);
> -               if (!we_locked)
> +       if (!is_locked) {
> +               if (!folio_test_anon(folio) || folio_test_ksm(folio)) {
> +                       we_locked = folio_trylock(folio);
> +                       if (!we_locked)
> +                               return 1;
> +               } else if (PageAnonExclusive(&folio->page)) {
>                         return 1;
> +               }
>         }
>
>         rmap_walk(folio, &rwc);
>
> I'm not at all an expert on this, so pardon my ignorance if this is wrong.

I see David's replied and he _is_ the expert on PAE so will see :)

> >
> > >
> > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> > > it would have meant that the write fault would be stuck until folio_referenced()
> > > would be done, which is also suboptimal.
> > >
> > > So likely, avoiding grabbing the folio lock was beneficial.
> > >
> > >
> > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set).
> >
> > Hm probably less of a problem that.
> >
> > >
> > >
> > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> > > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> > > would already refuse to reuse immediately, because it would spot a raised reference.
> > > The folio lock does not make a difference anymore.
> >
> > folio_check_references() we're good with anyway as folio already locked.
> >
> > So obviously shrink_active_list() is the only caller we really care about.
> >
> > That at least reduces this case, but we then have to deal with the fact we're
> > contending this lock elsewhere.
> >
> > >
> > >
> > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> > > except maybe some folio splitting or khugepaged that maybe would have to wait.
> > >
> > > But khugepaged would already also fail to isolate these folios, so probably it's not that
> > > relevant anymore ...
> >
> > This is it... there's a lot of possibilities and we need to tread extremely
> > carefully.
> >
> > if we could find a way to make uffd deal with this one way or another (or
> > possibly - detecting heavy anon vma lock contention) maybe that'd be
> > better... but then adding more complexity obv.
> >
> > >
> > > --
> > > Cheers
> > >
> > > David / dhildenb
> > >
> >
> > I mean having said all the above and also in other thread - I am open to being
> > convinced I'm wrong and this is ok.
> >
> > Obviously removing the complicated special case for anon would _in general_ be
> > nice, it's just very sensitive stuff :)
> >
> > Cheers, Lorenzo