linux-kernel - Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLGVsXpyUx9-ZRIl@willie-the-truck>
Date: Fri, 29 Aug 2025 12:57:37 +0100
From: Will Deacon <will@...nel.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Keir Fraser <keirf@...gle.com>,
	Jason Gunthorpe <jgg@...pe.ca>, John Hubbard <jhubbard@...dia.com>,
	Frederick Mayle <fmayle@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Xu <peterx@...hat.com>, Rik van Riel <riel@...riel.com>,
	Vlastimil Babka <vbabka@...e.cz>, Ge Yang <yangge1116@....com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
 attempting migration

Hi Hugh,

On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
> On Sun, 24 Aug 2025, Hugh Dickins wrote:
> > On Mon, 18 Aug 2025, Will Deacon wrote:
> > > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > > > I think replace the folio_test_mlocked(folio) part of it by
> > > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > > > That should reduce the extra calls to a much more reasonable
> > > > > number, while still solving your issue.
> > > > 
> > > > Alas, I fear that the folio may be unevictable by this point (which
> > > > seems to coincide with the readahead fault adding it to the LRU above)
> > > > but I can try it out.
> > > 
> > > I gave this a spin but I still see failures with this change.
> > 
> > Many thanks, Will, for the precisely relevant traces (in which,
> > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
> > 
> > Yes, those do indeed illustrate a case which my suggested
> > (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> > failed to cover.  Very helpful to have an example of that.
> > 
> > And many thanks, David, for your reminder of commit 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> > 
> > Yes, I strongly agree with your suggestion that the mlock batch
> > be brought into line with its change to the ordinary LRU batches,
> > and agree that doing so will be likely to solve Will's issue
> > (and similar cases elsewhere, without needing to modify them).
> > 
> > Now I just have to cool my head and get back down into those
> > mlock batches.  I am fearful that making a change there to suit
> > this case will turn out later to break another case (and I just
> > won't have time to redevelop as thorough a grasp of the races as
> > I had back then).  But if we're lucky, applying that "one batch
> > at a time" rule will actually make it all more comprehensible.
> > 
> > (I so wish we had spare room in struct page to keep the address
> > of that one batch entry, or the CPU to which that one batch
> > belongs: then, although that wouldn't eliminate all uses of
> > lru_add_drain_all(), it would allow us to efficiently extract
> > a target page from its LRU batch without a remote drain.)
> > 
> > I have not yet begun to write such a patch, and I'm not yet sure
> > that it's even feasible: this mail sent to get the polite thank
> > yous out of my mind, to help clear it for getting down to work.
> 
> It took several days in search of the least bad compromise, but
> in the end I concluded the opposite of what we'd intended above.
> 
> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> and Ge Yang's 6.11 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").

That's actually pretty good news, as I was initially worried that we'd
have to backport a fix all the way back to 6.1. From the above, the only
LTS affected is 6.12.y.

> It turns out that the mm/swap.c folio batches (apart from lru_add)
> are all for best-effort, doesn't matter if it's missed, operations;
> whereas mlock and munlock are more serious.  Probably mlock could
> be (not very satisfactorily) converted, but then munlock?  Because
> of failed folio_test_clear_lru()s, it would be far too likely to
> err on either side, munlocking too soon or too late.
> 
> I've concluded that one or the other has to go.  If we're having
> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> than 2fbb0c10d1e8 (which is itself far from perfect).  But functionally,
> I'm afraid that removing the mlock/munlock batching will show up as a
> perceptible regression in realistic workloadsg; and on consideration,
> I've found no real justification for the LRU flag clearing change.
> 
> Unless I'm mistaken, collect_longterm_unpinnable_folios() should
> never have been relying on folio_test_lru(), and should simply be
> checking for expected ref_count instead.
> 
> Will, please give the portmanteau patch (combination of four)
> below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
> corrected test in collect...(), preparatory lru_add_drain() there.
> 
> I hope you won't be proving me wrong again, and I can move on to
> writing up those four patches (and adding probably three more that
> make sense in such a series, but should not affect your testing).
> 
> I've tested enough to know that it's not harmful, but am hoping
> to take advantage of your superior testing, particularly in the
> GUP pin area.  But if you're uneasy with the combination, and would
> prefer to check just the minimum, then ignore the reversions and try
> just the mm/gup.c part of it - that will probably be good enough for
> you even without the reversions.

Thanks, I'll try to test the whole lot. I was geographically separated
from my testing device yesterday but I should be able to give it a spin
later today. I'm _supposed_ to be writing my KVM Forum slides for next
week, so this offers a perfect opportunity to procrastinate.

> Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> (or an intervening release), I already did the backport so please just
> ask.

We've got 6.15 working well at the moment, so I'll backport your diff
to that.

One question on the diff below:

> Thanks!
> 
>  mm/gup.c    |    5 ++++-
>  mm/swap.c   |   50 ++++++++++++++++++++++++++------------------------
>  mm/vmscan.c |    2 +-
>  3 files changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..9f7c87f504a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>  	struct folio *folio;
>  	long i = 0;
>  
> +	lru_add_drain();
> +
>  	for (folio = pofs_get_folio(pofs, i); folio;
>  	     folio = pofs_next_folio(folio, pofs, &i)) {
>  
> @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>  			continue;
>  		}
>  
> -		if (!folio_test_lru(folio) && drain_allow) {
> +		if (drain_allow && folio_ref_count(folio) !=
> +				   folio_expected_ref_count(folio) + 1) {
>  			lru_add_drain_all();

How does this synchronise with the folio being added to the mlock batch
on another CPU?

need_mlock_drain(), which is what I think lru_add_drain_all() ends up
using to figure out which CPU batches to process, just looks at the
'nr' field in the batch and I can't see anything in mlock_folio() to
ensure any ordering between adding the folio to the batch and
incrementing its refcount.

Then again, my hack to use folio_test_mlocked() would have a similar
issue because the flag is set (albeit with barrier semantics) before
adding the folio to the batch, meaning the drain could miss the folio.

I guess there's some higher-level synchronisation making this all work,
but it would be good to understand that as I can't see that
collect_longterm_unpinnable_folios() can rely on much other than the pin.

Will