linux-kernel - Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8376d8a3-cc36-ae70-0fa8-427e9ca17b9b@google.com>
Date: Thu, 28 Aug 2025 01:47:14 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Will Deacon <will@...nel.org>
cc: Hugh Dickins <hughd@...gle.com>, David Hildenbrand <david@...hat.com>, 
    linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
    Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>, 
    John Hubbard <jhubbard@...dia.com>, Frederick Mayle <fmayle@...gle.com>, 
    Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>, 
    Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>, 
    Ge Yang <yangge1116@....com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
 attempting migration

On Sun, 24 Aug 2025, Hugh Dickins wrote:
> On Mon, 18 Aug 2025, Will Deacon wrote:
> > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > > I think replace the folio_test_mlocked(folio) part of it by
> > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > > That should reduce the extra calls to a much more reasonable
> > > > number, while still solving your issue.
> > > 
> > > Alas, I fear that the folio may be unevictable by this point (which
> > > seems to coincide with the readahead fault adding it to the LRU above)
> > > but I can try it out.
> > 
> > I gave this a spin but I still see failures with this change.
> 
> Many thanks, Will, for the precisely relevant traces (in which,
> by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
> 
> Yes, those do indeed illustrate a case which my suggested
> (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> failed to cover.  Very helpful to have an example of that.
> 
> And many thanks, David, for your reminder of commit 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> 
> Yes, I strongly agree with your suggestion that the mlock batch
> be brought into line with its change to the ordinary LRU batches,
> and agree that doing so will be likely to solve Will's issue
> (and similar cases elsewhere, without needing to modify them).
> 
> Now I just have to cool my head and get back down into those
> mlock batches.  I am fearful that making a change there to suit
> this case will turn out later to break another case (and I just
> won't have time to redevelop as thorough a grasp of the races as
> I had back then).  But if we're lucky, applying that "one batch
> at a time" rule will actually make it all more comprehensible.
> 
> (I so wish we had spare room in struct page to keep the address
> of that one batch entry, or the CPU to which that one batch
> belongs: then, although that wouldn't eliminate all uses of
> lru_add_drain_all(), it would allow us to efficiently extract
> a target page from its LRU batch without a remote drain.)
> 
> I have not yet begun to write such a patch, and I'm not yet sure
> that it's even feasible: this mail sent to get the polite thank
> yous out of my mind, to help clear it for getting down to work.

It took several days in search of the least bad compromise, but
in the end I concluded the opposite of what we'd intended above.

There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
("mm/munlock: mlock_page() munlock_page() batch by pagevec")
and Ge Yang's 6.11 33dfe9204f29
("mm/gup: clear the LRU flag of a page before adding to LRU batch").

It turns out that the mm/swap.c folio batches (apart from lru_add)
are all for best-effort, doesn't matter if it's missed, operations;
whereas mlock and munlock are more serious.  Probably mlock could
be (not very satisfactorily) converted, but then munlock?  Because
of failed folio_test_clear_lru()s, it would be far too likely to
err on either side, munlocking too soon or too late.

I've concluded that one or the other has to go.  If we're having
a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
than 2fbb0c10d1e8 (which is itself far from perfect).  But functionally,
I'm afraid that removing the mlock/munlock batching will show up as a
perceptible regression in realistic workloadsg; and on consideration,
I've found no real justification for the LRU flag clearing change.

Unless I'm mistaken, collect_longterm_unpinnable_folios() should
never have been relying on folio_test_lru(), and should simply be
checking for expected ref_count instead.

Will, please give the portmanteau patch (combination of four)
below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
corrected test in collect...(), preparatory lru_add_drain() there.

I hope you won't be proving me wrong again, and I can move on to
writing up those four patches (and adding probably three more that
make sense in such a series, but should not affect your testing).

I've tested enough to know that it's not harmful, but am hoping
to take advantage of your superior testing, particularly in the
GUP pin area.  But if you're uneasy with the combination, and would
prefer to check just the minimum, then ignore the reversions and try
just the mm/gup.c part of it - that will probably be good enough for
you even without the reversions.

Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
(or an intervening release), I already did the backport so please just
ask.

Thanks!

 mm/gup.c    |    5 ++++-
 mm/swap.c   |   50 ++++++++++++++++++++++++++------------------------
 mm/vmscan.c |    2 +-
 3 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..9f7c87f504a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 	struct folio *folio;
 	long i = 0;
 
+	lru_add_drain();
+
 	for (folio = pofs_get_folio(pofs, i); folio;
 	     folio = pofs_next_folio(folio, pofs, &i)) {
 
@@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (!folio_test_lru(folio) && drain_allow) {
+		if (drain_allow && folio_ref_count(folio) !=
+				   folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
diff --git a/mm/swap.c b/mm/swap.c
index 3632dd061beb..8ef3dea20e39 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	for (i = 0; i < folio_batch_count(fbatch); i++) {
 		struct folio *folio = fbatch->folios[i];
 
+		/* block memcg migration while the folio moves between lru */
+		if (move_fn != lru_add && !folio_test_clear_lru(folio))
+			continue;
+
 		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
 		move_fn(lruvec, folio);
 
@@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 }
 
 static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
-		struct folio *folio, move_fn_t move_fn,
-		bool on_lru, bool disable_irq)
+		struct folio *folio, move_fn_t move_fn, bool disable_irq)
 {
 	unsigned long flags;
 
-	if (on_lru && !folio_test_clear_lru(folio))
-		return;
-
 	folio_get(folio);
 
 	if (disable_irq)
@@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 	else
 		local_lock(&cpu_fbatches.lock);
 
-	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
-	    lru_cache_disabled())
+	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+			folio_test_large(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
@@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_unlock(&cpu_fbatches.lock);
 }
 
-#define folio_batch_add_and_move(folio, op, on_lru)						\
-	__folio_batch_add_and_move(								\
-		&cpu_fbatches.op,								\
-		folio,										\
-		op,										\
-		on_lru,										\
-		offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq)	\
+#define folio_batch_add_and_move(folio, op)		\
+	__folio_batch_add_and_move(			\
+		&cpu_fbatches.op,			\
+		folio,					\
+		op,					\
+		offsetof(struct cpu_fbatches, op) >=	\
+		offsetof(struct cpu_fbatches, lock_irq)	\
 	)
 
 static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
@@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
 void folio_rotate_reclaimable(struct folio *folio)
 {
 	if (folio_test_locked(folio) || folio_test_dirty(folio) ||
-	    folio_test_unevictable(folio))
+	    folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_move_tail, true);
+	folio_batch_add_and_move(folio, lru_move_tail);
 }
 
 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
@@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu)
 
 void folio_activate(struct folio *folio)
 {
-	if (folio_test_active(folio) || folio_test_unevictable(folio))
+	if (folio_test_active(folio) || folio_test_unevictable(folio) ||
+	    !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_activate, true);
+	folio_batch_add_and_move(folio, lru_activate);
 }
 
 #else
@@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio)
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
 		folio_set_active(folio);
 
-	folio_batch_add_and_move(folio, lru_add, false);
+	folio_batch_add_and_move(folio, lru_add);
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu)
 void deactivate_file_folio(struct folio *folio)
 {
 	/* Deactivating an unevictable folio will not accelerate reclaim */
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() && lru_gen_clear_refs(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate_file, true);
+	folio_batch_add_and_move(folio, lru_deactivate_file);
 }
 
 /*
@@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio *folio)
  */
 void folio_deactivate(struct folio *folio)
 {
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate, true);
+	folio_batch_add_and_move(folio, lru_deactivate);
 }
 
 /**
@@ -723,10 +724,11 @@ void folio_deactivate(struct folio *folio)
 void folio_mark_lazyfree(struct folio *folio)
 {
 	if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
+	    !folio_test_lru(folio) ||
 	    folio_test_swapcache(folio) || folio_test_unevictable(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_lazyfree, true);
+	folio_batch_add_and_move(folio, lru_lazyfree);
 }
 
 void lru_add_drain(void)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a48aec8bfd92..674999999cd0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4507,7 +4507,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* ineligible */
-	if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
+	if (zone > sc->reclaim_idx) {
 		gen = folio_inc_gen(lruvec, folio, false);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;