[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <92b16e68-0083-48aa-b281-5c235fd91bfe@redhat.com>
Date: Sat, 16 Aug 2025 10:15:59 +0200
From: David Hildenbrand <david@...hat.com>
To: Hugh Dickins <hughd@...gle.com>, Will Deacon <will@...nel.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>,
John Hubbard <jhubbard@...dia.com>, Frederick Mayle <fmayle@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>,
Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
attempting migration
On 16.08.25 06:14, Hugh Dickins wrote:
> On Fri, 15 Aug 2025, Will Deacon wrote:
>
>> When taking a longterm GUP pin via pin_user_pages(),
>> __gup_longterm_locked() tries to migrate target folios that should not
>> be longterm pinned, for example because they reside in a CMA region or
>> movable zone. This is done by first pinning all of the target folios
>> anyway, collecting all of the longterm-unpinnable target folios into a
>> list, dropping the pins that were just taken and finally handing the
>> list off to migrate_pages() for the actual migration.
>>
>> It is critically important that no unexpected references are held on the
>> folios being migrated, otherwise the migration will fail and
>> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
>> relatively easy to observe migration failures when running pKVM (which
>> uses pin_user_pages() on crosvm's virtual address space to resolve
>> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
>> this results in the VM terminating prematurely.
>>
>> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
>> mapping of guest memory prior to the pinning. Subsequently, when
>> pin_user_pages() walks the page-table, the relevant 'pte' is not
>> present and so the faulting logic allocates a new folio, mlocks it
>> with mlock_folio() and maps it in the page-table.
>>
>> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
>> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
>> are deferred. For example, mlock_folio() takes an additional reference
>> on the target folio before placing it into a per-cpu 'folio_batch' for
>> later processing by mlock_folio_batch(), which drops the refcount once
>> the operation is complete. Processing of the batches is coupled with
>> the LRU batch logic and can be forcefully drained with
>> lru_add_drain_all() but as long as a folio remains unprocessed on the
>> batch, its refcount will be elevated.
>>
>> This deferred batching therefore interacts poorly with the pKVM pinning
>> scenario as we can find ourselves in a situation where the migration
>> code fails to migrate a folio due to the elevated refcount from the
>> pending mlock operation.
>
> Thanks for the very full description, Will, that helped me a lot
> (I know very little of the GUP pinning end).
>
> But one thing would help me to understand better: are the areas being
> pinned anonymous or shmem or file memory (or COWed shmem or file)?
>
> From "the faulting logic allocates a new folio" I first assumed
> anonymous, but later came to think "mlocks it with mlock_folio()"
> implies they are shmem or file folios (which, yes, can also be
> allocated by fault).
>
> IIRC anonymous and COW faults would go the mlock_new_folio() way,
> where the folio goes on to the mlock folio batch without having yet
> reached LRU: those should be dealt with by the existing
> !folio_test_lru() check.
Very right. Because I don't remember easily running into that in
wp_can_reuse_anon_folio(), where we also drain the lru if we spot a !lru
folio.
Having to enlighten all such code to check for
if (!folio_test_lru(folio) || folio_test_mlocked(folio))
is a bit suboptimal, but we could add a helper for that. (we might end
up draining for actually mlocked folios though :( )
We recently changed in 33dfe9204f29 ("mm/gup: clear the LRU flag of a
page before adding to LRU batch") that a folio can really only be once
in an LRU batch.
So if we spot !folio_test_lru(folio) there might be at most one
reference from LRU batches.
So naturally I wonder if we could somehow make it work that a folio is
either on the ordinary LRU batch or on the mlock batch, but not on both.
And that the folio LRU flag would be sufficient to check.
That would mean that we would clear the LRU flag when adding a folio to
the mlock batch. But not sure how feasible that is (e.g., folio already
isolated).
--
Cheers
David / dhildenb
Powered by blists - more mailing lists