[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b65dd020-5e02-4863-8994-def576e3d3dd@redhat.com>
Date: Mon, 16 Jun 2025 22:41:20 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>, Matthew Wilcox
<willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via
MREMAP_RELOCATE_ANON
On 16.06.25 22:24, David Hildenbrand wrote:
> Hi Lorenzo,
>
> as discussed offline, there is a lot going on an this is rather ... a
> lot of code+complexity for something that is more a corner cases. :)
>
> Corner-case as in: only select user space will benefit from this, which
> is really a shame.
>
> After your presentation at LSF/MM, I thought about this further, and I
> was wondering whether:
>
> (a) We cannot make this semi-automatic, avoiding flags.
>
> (b) We cannot simplify further by limiting it to the common+easy cases
> first.
>
> I think you already to some degree did b) as part of this non-RFC, which
> is great.
>
>
> So before digging into the details, let's discuss the high level problem
> briefly.
>
> I think there are three parts to it:
>
> (1) Detecting whether it is safe to adjust the folio->index (small
> folios)
>
> (2) Performance implications of doing so
>
> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
> mapped folios)
>
>
> Regarding (1), if we simply track whether a folio was ever used for
> COW-sharing, it would be very easy: and not only for present folios, but
> for any anon folios that are referenced by swap/migration entries.
> Skimming over patch #1, I think you apply a similar logic, which is good.
>
> Regarding (2), it would apply when we mremap() anon VMAs and they happen
> to reside next to other anon VMAs. Which workloads are we concerned
> about harming by implementing this optimization? I recall that the most
> common use case for mremap() is actually for file mappings, but I might
> be wrong. In any case, we could just have a different way to enable this
> optimization than for each and every mremap() invocation in a process.
>
> Regarding (3), if we were to split large folios that cross VMA
> boundaries during mremap(), it would be simpler.
>
> How is it handled in this series if we large folio crosses VMA
> boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).
>
>
>> This also creates a difference in behaviour, often surprising to users,
>> between mappings which are faulted and those which are not - as for the
>> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
>>
>> This is problematic firstly because this proliferates kernel allocations
>> that are pure memory pressure - unreclaimable and unmovable -
>> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
>> remaps which span multiple VMAs (though it does permit remaps that
>> constitute a part of a single VMA).
>
> If I mremap() to create a hole and mremap() it back, I would assume to
> automatically get the hole closed again, without special flags. Well, we
> both know this is not the case :)
>
> > > This means that a user must concern themselves with whether merges
> succeed
>> or not should they wish to use mremap() in such a way which causes multiple
>> mremap() calls to be performed upon mappings.
>
> Right.
>
>>
>> This series provides users with an option to accept the overhead of
>> actually updating the VMA and underlying folios via the
>> MREMAP_RELOCATE_ANON flag.
>
> Okay. I wish we could avoid this flag ...
>
>>
>> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
>> the mremap() succeeding, then no attempt is made at relocation of folios as
>> this is not required.
>
> Makes sense. This is the existing behavior then.
>
>>
>> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
>> folio->index fields are appropriately updated in order that subsequent
>> mremap() or mprotect() calls will succeed in merging.
>
> By looking at the surrounding VMAs or simply by trying to always keep
> the folio->index to correspond to the address in the VMA? (just if
> mremap() never happened, I assume?)
>
>>
>> This flag falls back to the ordinary means of mremap() should the operation
>> not be feasible. It also transparently undoes the operation, carefully
>> holding rmap locks such that no racing rmap operation encounters incorrect
>> or missing VMAs.
>
> I absolutely dislike this undo operation, really. :(
>
> I hope we can find a way to just detect early whether this optimization
> would work.
>
> Which are the exact error cases you can run into for un-doing?
>
> I assume:
>
> (a) cow-shared anon folio (can detect early)
>
> (b) large folios crossing VMAs (TBD)
>
> (c) KSM folios? Probably we could move them, I *think* we would have to
> update the ksm_rmap_item. Alternatively, we could indicate if a VMA had
> any KSM folios and give up early in the first version.
Looking at patch #1, I can see that we treat KSM folios as "success".
I would have thought we would have to update the corresponding
"ksm_rmap_item" ... somehow, to keep the rmap working.
I know that Wei Yang (already on cc) is working on selftests, which I am
yet to review, but he doesn't cover mremap() yet.
Looking at rmap_walk_ksm(), I am left a bit confused.
We walk all entries in the stable tree (ksm_rmap_item), looking in the
anon_vma interval tree for the entry that corresponds to
ksm_rmap_item->address.
addr = rmap_item->address & PAGE_MASK;
if (addr < vma->vm_start || addr >= vma->vm_end)
continue;
So I would assume, already when we mremap() ... we are *already*
breaking KSM rmap walkers? :) Or there is somewhere some magic that I am
missing.
A KSM mremap test case for rmap would be nice ;)
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists