linux-kernel - Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ce169c2-09b7-39e3-d00b-ba1db6dd258c@google.com>
Date: Fri, 29 Aug 2025 08:46:52 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Will Deacon <will@...nel.org>
cc: Hugh Dickins <hughd@...gle.com>, David Hildenbrand <david@...hat.com>, 
    linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
    Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>, 
    John Hubbard <jhubbard@...dia.com>, Frederick Mayle <fmayle@...gle.com>, 
    Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>, 
    Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>, 
    Ge Yang <yangge1116@....com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
 attempting migration

On Fri, 29 Aug 2025, Will Deacon wrote:
> On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
...
> > 
> > It took several days in search of the least bad compromise, but
> > in the end I concluded the opposite of what we'd intended above.
> > 
> > There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> > ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> > and Ge Yang's 6.11 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> 
> That's actually pretty good news, as I was initially worried that we'd
> have to backport a fix all the way back to 6.1. From the above, the only
> LTS affected is 6.12.y.

I'm not so sure of that. I think the 6.11 change tickled a particular
sequence that showed up in your testing, but the !folio_test_lru()
was never an adequate test for whether lru_add_drain_all() might help.

I can't try to estimate the probabilities, you'll have to make your
own decision, whether it's worth going back to change a release which
did not (I presume) show problems in real life.

...
> > Unless I'm mistaken, collect_longterm_unpinnable_folios() should
> > never have been relying on folio_test_lru(), and should simply be
> > checking for expected ref_count instead.
> > 
> > Will, please give the portmanteau patch (combination of four)
> > below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
> > corrected test in collect...(), preparatory lru_add_drain() there.
> > 
> > I hope you won't be proving me wrong again, and I can move on to
> > writing up those four patches (and adding probably three more that
> > make sense in such a series, but should not affect your testing).
> > 
> > I've tested enough to know that it's not harmful, but am hoping
> > to take advantage of your superior testing, particularly in the
> > GUP pin area.  But if you're uneasy with the combination, and would
> > prefer to check just the minimum, then ignore the reversions and try
> > just the mm/gup.c part of it - that will probably be good enough for
> > you even without the reversions.
> 
> Thanks, I'll try to test the whole lot. I was geographically separated
> from my testing device yesterday but I should be able to give it a spin
> later today. I'm _supposed_ to be writing my KVM Forum slides for next
> week, so this offers a perfect opportunity to procrastinate.

Well understood :) And you've already reported on the testing, thanks.

> 
> > Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> > (or an intervening release), I already did the backport so please just
> > ask.
> 
> We've got 6.15 working well at the moment, so I'll backport your diff
> to that.
> 
> One question on the diff below:
> 
> > Thanks!
> > 
> >  mm/gup.c    |    5 ++++-
> >  mm/swap.c   |   50 ++++++++++++++++++++++++++------------------------
> >  mm/vmscan.c |    2 +-
> >  3 files changed, 31 insertions(+), 26 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..9f7c87f504a9 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> >  	struct folio *folio;
> >  	long i = 0;
> >  
> > +	lru_add_drain();
> > +
> >  	for (folio = pofs_get_folio(pofs, i); folio;
> >  	     folio = pofs_next_folio(folio, pofs, &i)) {
> >  
> > @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> >  			continue;
> >  		}
> >  
> > -		if (!folio_test_lru(folio) && drain_allow) {
> > +		if (drain_allow && folio_ref_count(folio) !=
> > +				   folio_expected_ref_count(folio) + 1) {
> >  			lru_add_drain_all();
> 
> How does this synchronise with the folio being added to the mlock batch
> on another CPU?
> 
> need_mlock_drain(), which is what I think lru_add_drain_all() ends up
> using to figure out which CPU batches to process, just looks at the
> 'nr' field in the batch and I can't see anything in mlock_folio() to
> ensure any ordering between adding the folio to the batch and
> incrementing its refcount.
> 
> Then again, my hack to use folio_test_mlocked() would have a similar
> issue because the flag is set (albeit with barrier semantics) before
> adding the folio to the batch, meaning the drain could miss the folio.
> 
> I guess there's some higher-level synchronisation making this all work,
> but it would be good to understand that as I can't see that
> collect_longterm_unpinnable_folios() can rely on much other than the pin.

No such strict synchronization: you've been misled if people have told
you that this pinning migration stuff is deterministically successful:
it's best effort - or will others on the Cc disagree?

Just as there's no synchronization between the calculation inside
folio_expected_ref_count() and the reading of folio's refcount.

It wouldn't make sense for this unpinnable collection to anguish over
such synchronization, when a moment later the migration is liable to
fail (on occasion) for other transient reasons.  All ending up reported
as -ENOMEM apparently? that looks unhelpful.

There is a heavy hammer called lru_cache_disable() in mm/swap.c,
stopping the batching and doing its own lru_add_drain_all(): that is
used by CMA and a few others, but not by this pinning (and I do not
want to argue that it should be).

Hugh