linux-kernel - Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKMrOHYbTtDhOP6O@willie-the-truck>
Date: Mon, 18 Aug 2025 14:31:36 +0100
From: Will Deacon <will@...nel.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>,
	David Hildenbrand <david@...hat.com>,
	John Hubbard <jhubbard@...dia.com>,
	Frederick Mayle <fmayle@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Xu <peterx@...hat.com>, Rik van Riel <riel@...riel.com>,
	Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
 attempting migration

Hi Hugh,

On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> On Fri, 15 Aug 2025, Will Deacon wrote:
> 
> > When taking a longterm GUP pin via pin_user_pages(),
> > __gup_longterm_locked() tries to migrate target folios that should not
> > be longterm pinned, for example because they reside in a CMA region or
> > movable zone. This is done by first pinning all of the target folios
> > anyway, collecting all of the longterm-unpinnable target folios into a
> > list, dropping the pins that were just taken and finally handing the
> > list off to migrate_pages() for the actual migration.
> > 
> > It is critically important that no unexpected references are held on the
> > folios being migrated, otherwise the migration will fail and
> > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> > relatively easy to observe migration failures when running pKVM (which
> > uses pin_user_pages() on crosvm's virtual address space to resolve
> > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> > this results in the VM terminating prematurely.
> > 
> > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> > mapping of guest memory prior to the pinning. Subsequently, when
> > pin_user_pages() walks the page-table, the relevant 'pte' is not
> > present and so the faulting logic allocates a new folio, mlocks it
> > with mlock_folio() and maps it in the page-table.
> > 
> > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> > batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> > are deferred. For example, mlock_folio() takes an additional reference
> > on the target folio before placing it into a per-cpu 'folio_batch' for
> > later processing by mlock_folio_batch(), which drops the refcount once
> > the operation is complete. Processing of the batches is coupled with
> > the LRU batch logic and can be forcefully drained with
> > lru_add_drain_all() but as long as a folio remains unprocessed on the
> > batch, its refcount will be elevated.
> > 
> > This deferred batching therefore interacts poorly with the pKVM pinning
> > scenario as we can find ourselves in a situation where the migration
> > code fails to migrate a folio due to the elevated refcount from the
> > pending mlock operation.
> 
> Thanks for the very full description, Will, that helped me a lot
> (I know very little of the GUP pinning end).
> 
> But one thing would help me to understand better: are the areas being
> pinned anonymous or shmem or file memory (or COWed shmem or file)?

crosvm is using memfd_create() so I think that means it's anonymous and
shmem? I'm not entirely sure what the right terminology is for a memfd.

> From "the faulting logic allocates a new folio" I first assumed
> anonymous, but later came to think "mlocks it with mlock_folio()"
> implies they are shmem or file folios (which, yes, can also be
> allocated by fault).
> 
> IIRC anonymous and COW faults would go the mlock_new_folio() way,
> where the folio goes on to the mlock folio batch without having yet
> reached LRU: those should be dealt with by the existing
> !folio_test_lru() check.

Most of my analysis has been constructed from backtraces when we've
managed to catch the problem. Thankfully, I've saved most of them, so
I went back to have a look and you're absolutely right -- it's _not_
the pin_user_pages() which allocates the page in this case, so apologies
for getting that wrong in the commit message. I've clearly been reading
too many trace logs!

The page is allocated and mlocked just before the pin thanks to a user
page fault:

 crosvm_VmRunApp-6493    [007] ...2.   144.767288: page_ref_mod: pfn=0x9f83a8 flags=locked|uptodate|swapbacked|mlocked count=4 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1
 crosvm_VmRunApp-6493    [007] ...2.   144.767288: <stack trace>
 => trace_event_raw_event_page_ref_mod_template
 => __page_ref_mod
 => mlock_folio
 => folio_add_file_rmap_ptes
 => set_pte_range
 => finish_fault
 => do_pte_missing
 => handle_mm_fault
 => do_page_fault
 => do_translation_fault
 => do_mem_abort
 => el0_da
 => el0t_64_sync_handler
 => el0t_64_sync

and the pin_user_pages() comes in a little later (on a different CPU):

    crosvm_vcpu0-6499    [003] ...1.   144.786828: page_ref_mod: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=1027 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=1024
    crosvm_vcpu0-6499    [003] ...1.   144.786832: <stack trace>
 => trace_event_raw_event_page_ref_mod_template
 => __page_ref_mod
 => try_grab_folio
 => follow_page_pte
 => __get_user_pages
 => __gup_longterm_locked
 => pin_user_pages
 => __pkvm_pin_user_pages
 => pkvm_mem_abort
 => pkvm_mem_abort_prefault
 => kvm_handle_guest_abort
 => handle_exit
 => kvm_arch_vcpu_ioctl_run
 => kvm_vcpu_ioctl
 => __arm64_sys_ioctl
 => invoke_syscall
 => el0_svc_common
 => do_el0_svc
 => el0_svc
 => el0t_64_sync_handler
 => el0t_64_sync

Between the two there's an interesting filemap fault triggering
readahead and that's what adds the folio to the LRU:

 crosvm_VmRunApp-6493    [007] ...1.   144.767358: page_ref_mod_and_test: pfn=0x9f83a8 flags=uptodate|dirty|lru|swapbacked|unevictable|mlocked count=3 mapcount=0 mapping=00000000f0e9c5fd mt=4 val=-1 ret=0
 crosvm_VmRunApp-6493    [007] ...1.   144.767359: <stack trace>
 => trace_event_raw_event_page_ref_mod_and_test_template
 => __page_ref_mod_and_test
 => folios_put_refs
 => folio_batch_move_lru
 => __folio_batch_add_and_move
 => folio_add_lru
 => filemap_add_folio
 => page_cache_ra_unbounded
 => page_cache_ra_order
 => do_sync_mmap_readahead
 => filemap_fault
 => __do_fault
 => do_pte_missing
 => handle_mm_fault
 => do_page_fault
 => do_translation_fault
 => do_mem_abort
 => el0_ia
 => el0t_64_sync_handler
 => el0t_64_sync

> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..656835890f05 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> >  			continue;
> >  		}
> >  
> > -		if (!folio_test_lru(folio) && drain_allow) {
> > +		if (drain_allow &&
> > +		   (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
> >  			lru_add_drain_all();
> >  			drain_allow = false;
> >  		}
> 
> Hmm.  That is going to call lru_add_drain_all() whenever any of the
> pages in the list is mlocked, and lru_add_drain_all() is a function
> we much prefer to avoid calling (it's much better than the old days
> when it could involve every CPU IPIing every other CPU at the same
> time; but it's still raising doubts to this day, and best avoided).
>
> (Not as bad as I first thought: those unpinnably-placed mlocked
> folios will get migrated, not appearing again in repeat runs.)
> 
> I think replace the folio_test_mlocked(folio) part of it by
> (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> That should reduce the extra calls to a much more reasonable
> number, while still solving your issue.

Alas, I fear that the folio may be unevictable by this point (which
seems to coincide with the readahead fault adding it to the LRU above)
but I can try it out.

> But in addition, please add an unconditional lru_add_drain()
> (the local CPU one, not the inter-CPU _all) at the head of
> collect_longterm_unpinnable_folios().  My guess is that that
> would eliminate 90% of the calls to the lru_add_drain_all() below:
> not quite enough to satisfy you, but enough to be a good improvement.

Sure, that's straightforward enough. We do already see an
mlock_drain_local() from try_to_migrate_one() but that happens after all
the unpinnable folios have been collected so it doesn't help us here.

Will