linux-kernel - Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <56819052-d3f5-4209-824d-5cfbf49ff6e9@redhat.com>
Date: Thu, 28 Aug 2025 22:38:50 +0200
From: David Hildenbrand <david@...hat.com>
To: Hugh Dickins <hughd@...gle.com>
Cc: Will Deacon <will@...nel.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Keir Fraser <keirf@...gle.com>,
 Jason Gunthorpe <jgg@...pe.ca>, John Hubbard <jhubbard@...dia.com>,
 Frederick Mayle <fmayle@...gle.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>,
 Rik van Riel <riel@...riel.com>, Vlastimil Babka <vbabka@...e.cz>,
 Ge Yang <yangge1116@....com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before
 attempting migration

On 28.08.25 18:12, Hugh Dickins wrote:
> On Thu, 28 Aug 2025, David Hildenbrand wrote:
>> On 28.08.25 10:47, Hugh Dickins wrote:
> ...
>>> It took several days in search of the least bad compromise, but
>>> in the end I concluded the opposite of what we'd intended above.
>>>
>>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
>>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
>>> and Ge Yang's 6.11 33dfe9204f29
>>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
>>>
>>> It turns out that the mm/swap.c folio batches (apart from lru_add)
>>> are all for best-effort, doesn't matter if it's missed, operations;
>>> whereas mlock and munlock are more serious.  Probably mlock could
>>> be (not very satisfactorily) converted, but then munlock?  Because
>>> of failed folio_test_clear_lru()s, it would be far too likely to
>>> err on either side, munlocking too soon or too late.
>>>
>>> I've concluded that one or the other has to go.  If we're having
>>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
>>> than 2fbb0c10d1e8 (which is itself far from perfect).  But functionally,
>>> I'm afraid that removing the mlock/munlock batching will show up as a
>>> perceptible regression in realistic workloadsg; and on consideration,
>>> I've found no real justification for the LRU flag clearing change.
>>
>> Just to understand what you are saying: are you saying that we will go back to
>> having a folio being part of multiple LRU caches?
> 
> Yes.  Well, if you count the mlock/munlock batches in as "LRU caches",
> then that has been so all along.

Yes ...

> 
>> :/ If so, I really rally
>> hope that we can find another way and not go back to that old handling.
> 
> For what reason?  It sounded like a nice "invariant" to keep in mind,
> but it's a limitation, and  what purpose was it actually serving?

I liked the semantics that if !lru, there could be at most one reference 
from the LRU caches.

That is, if there are two references, you don't even have to bother with 
flushing anything.

> 
> If it's the "spare room in struct page to keep the address of that one
> batch entry ... efficiently extract ..." that I was dreaming of: that
> has to be a future thing, when perhaps memdescs will allow an extensible
> structure to be attached, and extending it for an mlocked folio (to hold
> the mlock_count instead of squeezing it into lru.prev) would not need
> mlock/munlock batching at all (I guess: far from uppermost in my mind!),
> and including a field for "efficiently extract" from LRU batch would be
> nice.
> 
> But, memdescs or not, there will always be pressure to keep the common
> struct as small as possible, so I don't know if we would actually go
> that way.
> 
> But I suspect that was not your reason at all: please illuminate.

You are very right :)

Regarding the issue at hand:

There were discussions at LSF/MM about also putting (some) large folios 
onto the LRU caches.

In that context, GUP could take multiple references on the same folio, 
and a simple folio_expected_ref_count() + 1 would no longer do the trick.

I thought about this today, and likely it could be handled by scanning 
the page array for same folios etc. A bit nasty when wanting to cover 
all corner cases (folio pages must not be consecutive in the passed 
array ) ...

Apart from that issue, I liked the idea of a "single entry in the cache" 
for other reasons: it gives clear semantics. We cannot end up in a 
scenario where someone performs OPX and later someone OPY on a folio, 
but the way the lru caches are flushed we might end up processing OPX 
after OPY -- this should be able to happen in case of local or remote 
flushes IIRC.

Anyhow, I quickly scanned your code. The folio_expected_ref_count() 
should solve the issue for now. It's quite unfortunate that any raised 
reference will make us now flush all remote LRU caches.

Maybe we just want to limit it to !folio_test_large(), because flushing 
there really doesn't give us any chance in succeeding right now? Not 
sure if it makes any difference in practice, though, we'll likely be 
flushing remote LRU caches now more frequently either way.

-- 
Cheers

David / dhildenb