linux-kernel - Re: [DISCUSSION] Unconditionally lock folios when calling rmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a0350dd8-748b-41d5-899e-1505bd2b2e80@lucifer.local>
Date: Thu, 28 Aug 2025 13:04:36 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Lokesh Gidra <lokeshgidra@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        android-mm <android-mm@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Jann Horn <jannh@...gle.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote:
> On 22.08.25 19:29, Lokesh Gidra wrote:
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >          we_locked = folio_trylock(folio);
> >          if (!we_locked)
> >                  return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> >
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk().
>
> Essentially, what you mean is roughly:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80f..0800e73c0796e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;
>
>
> The downside of that change is that ordinary (!ksm) folios will observe being locked

Well anon folios, I guess this is what you meant :)

> when we are actually only trying to asses if they were referenced.
>
> Does it matter?

Also another downside is we try lock and abort if we fail, so we've made this
conditional on that.

But surely this is going to impact reclaim performance esp. under heavy memory
pressure? It is at least a trylock.

It's dangerous waters, and I'd really want some detailed data + analysis to
prove the point here, I don't think theorising about it is enough.

>
> I can only speculate that it might have been very relevant before
> 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
>
> Essentially any R/O fault would have resulted in us copying the page, simply because
> there is concurrent folio_referenced() happening.

Fun.

Thing is we now have to consider _every case_ where a contention might cause an
issue.

One thing I _was_ concerned about was:

- uffd move locks folios
- now folio_referenced() 'fails' returning 1

But case 2 is only in shrink_active_list() which uses this as a boolean...

OK so maybe fine for this one.

I really do also hate that any future callers are going to possibly be confused
about how this function works, but I guess it was already 'weird' for
file-backed/KSM.

So the issue remains really - folio lock contention as a result of this.

It's one thing to theorise, but you may be forgetting something... and then
we've changed an absolutely core thing to suit a niche UFFD use case.

I do wonder if we can identify this case and handle things differently.

Perhaps even saying 'try and get the rmap lock, but if there's "too much"
contention, grab the folio lock.

>
> Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> it would have meant that the write fault would be stuck until folio_referenced()
> would be done, which is also suboptimal.
>
> So likely, avoiding grabbing the folio lock was beneficial.
>
>
> Today, this would only affect R/O pages after fork (PageAnonExclusive not set).

Hm probably less of a problem that.

>
>
> Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> would already refuse to reuse immediately, because it would spot a raised reference.
> The folio lock does not make a difference anymore.

folio_check_references() we're good with anyway as folio already locked.

So obviously shrink_active_list() is the only caller we really care about.

That at least reduces this case, but we then have to deal with the fact we're
contending this lock elsewhere.

>
>
> Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> except maybe some folio splitting or khugepaged that maybe would have to wait.
>
> But khugepaged would already also fail to isolate these folios, so probably it's not that
> relevant anymore ...

This is it... there's a lot of possibilities and we need to tread extremely
carefully.

if we could find a way to make uffd deal with this one way or another (or
possibly - detecting heavy anon vma lock contention) maybe that'd be
better... but then adding more complexity obv.

>
> --
> Cheers
>
> David / dhildenb
>

I mean having said all the above and also in other thread - I am open to being
convinced I'm wrong and this is ok.

Obviously removing the complicated special case for anon would _in general_ be
nice, it's just very sensitive stuff :)

Cheers, Lorenzo