[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56819052-d3f5-4209-824d-5cfbf49ff6e9@redhat.com>
Date: Thu, 28 Aug 2025 22:38:50 +0200
From: David Hildenbrand <david@...hat.com>
To: Hugh Dickins <hughd@...gle.com>
Cc: Will Deacon <will@...nel.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Keir Fraser <keirf@...gle.com>,
Jason Gunthorpe <jgg@...pe.ca>, John Hubbard <jhubbard@...dia.com>,
Frederick Mayle <fmayle@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>,
Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>,
Ge Yang <yangge1116@....com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
attempting migration
On 28.08.25 18:12, Hugh Dickins wrote:
> On Thu, 28 Aug 2025, David Hildenbrand wrote:
>> On 28.08.25 10:47, Hugh Dickins wrote:
> ...
>>> It took several days in search of the least bad compromise, but
>>> in the end I concluded the opposite of what we'd intended above.
>>>
>>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
>>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
>>> and Ge Yang's 6.11 33dfe9204f29
>>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>>>
>>> It turns out that the mm/swap.c folio batches (apart from lru_add)
>>> are all for best-effort, doesn't matter if it's missed, operations;
>>> whereas mlock and munlock are more serious. Probably mlock could
>>> be (not very satisfactorily) converted, but then munlock? Because
>>> of failed folio_test_clear_lru()s, it would be far too likely to
>>> err on either side, munlocking too soon or too late.
>>>
>>> I've concluded that one or the other has to go. If we're having
>>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
>>> than 2fbb0c10d1e8 (which is itself far from perfect). But functionally,
>>> I'm afraid that removing the mlock/munlock batching will show up as a
>>> perceptible regression in realistic workloadsg; and on consideration,
>>> I've found no real justification for the LRU flag clearing change.
>>
>> Just to understand what you are saying: are you saying that we will go back to
>> having a folio being part of multiple LRU caches?
>
> Yes. Well, if you count the mlock/munlock batches in as "LRU caches",
> then that has been so all along.
Yes ...
>
>> :/ If so, I really rally
>> hope that we can find another way and not go back to that old handling.
>
> For what reason? It sounded like a nice "invariant" to keep in mind,
> but it's a limitation, and what purpose was it actually serving?
I liked the semantics that if !lru, there could be at most one reference
from the LRU caches.
That is, if there are two references, you don't even have to bother with
flushing anything.
>
> If it's the "spare room in struct page to keep the address of that one
> batch entry ... efficiently extract ..." that I was dreaming of: that
> has to be a future thing, when perhaps memdescs will allow an extensible
> structure to be attached, and extending it for an mlocked folio (to hold
> the mlock_count instead of squeezing it into lru.prev) would not need
> mlock/munlock batching at all (I guess: far from uppermost in my mind!),
> and including a field for "efficiently extract" from LRU batch would be
> nice.
>
> But, memdescs or not, there will always be pressure to keep the common
> struct as small as possible, so I don't know if we would actually go
> that way.
>
> But I suspect that was not your reason at all: please illuminate.
You are very right :)
Regarding the issue at hand:
There were discussions at LSF/MM about also putting (some) large folios
onto the LRU caches.
In that context, GUP could take multiple references on the same folio,
and a simple folio_expected_ref_count() + 1 would no longer do the trick.
I thought about this today, and likely it could be handled by scanning
the page array for same folios etc. A bit nasty when wanting to cover
all corner cases (folio pages must not be consecutive in the passed
array ) ...
Apart from that issue, I liked the idea of a "single entry in the cache"
for other reasons: it gives clear semantics. We cannot end up in a
scenario where someone performs OPX and later someone OPY on a folio,
but the way the lru caches are flushed we might end up processing OPX
after OPY -- this should be able to happen in case of local or remote
flushes IIRC.
Anyhow, I quickly scanned your code. The folio_expected_ref_count()
should solve the issue for now. It's quite unfortunate that any raised
reference will make us now flush all remote LRU caches.
Maybe we just want to limit it to !folio_test_large(), because flushing
there really doesn't give us any chance in succeeding right now? Not
sure if it makes any difference in practice, though, we'll likely be
flushing remote LRU caches now more frequently either way.
--
Cheers
David / dhildenb
Powered by blists - more mailing lists