linux-kernel - Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b65dd020-5e02-4863-8994-def576e3d3dd@redhat.com>
Date: Mon, 16 Jun 2025 22:41:20 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Matthew Wilcox
 <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
 Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
 Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
 Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
 Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On 16.06.25 22:24, David Hildenbrand wrote:
> Hi Lorenzo,
> 
> as discussed offline, there is a lot going on an this is rather ... a
> lot of code+complexity for something that is more a corner cases. :)
> 
> Corner-case as in: only select user space will benefit from this, which
> is really a shame.
> 
> After your presentation at LSF/MM, I thought about this further, and I
> was wondering whether:
> 
> (a) We cannot make this semi-automatic, avoiding flags.
> 
> (b) We cannot simplify further by limiting it to the common+easy cases
> first.
> 
> I think you already to some degree did b) as part of this non-RFC, which
> is great.
> 
> 
> So before digging into the details, let's discuss the high level problem
> briefly.
> 
> I think there are three parts to it:
> 
> (1) Detecting whether it is safe to adjust the folio->index (small
>       folios)
> 
> (2) Performance implications of doing so
> 
> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
>       mapped  folios)
> 
> 
> Regarding (1), if we simply track whether a folio was ever used for
> COW-sharing, it would be very easy: and not only for present folios, but
> for any anon folios that are referenced by swap/migration entries.
> Skimming over patch #1, I think you apply a similar logic, which is good.
> 
> Regarding (2), it would apply when we mremap() anon VMAs and they happen
> to reside next to other anon VMAs. Which workloads are we concerned
> about harming by implementing this optimization? I recall that the most
> common use case for mremap() is actually for file mappings, but I might
> be wrong. In any case, we could just have a different way to enable this
> optimization than for each and every mremap() invocation in a process.
> 
> Regarding (3), if we were to split large folios that cross VMA
> boundaries during mremap(), it would be simpler.
> 
> How is it handled in this series if we large folio crosses VMA
> boundaries? (a) try splitting or (b) fail (not transparent to the user :( ).
> 
> 
>> This also creates a difference in behaviour, often surprising to users,
>> between mappings which are faulted and those which are not - as for the
>> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
>>
>> This is problematic firstly because this proliferates kernel allocations
>> that are pure memory pressure - unreclaimable and unmovable -
>> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
>   > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
>> remaps which span multiple VMAs (though it does permit remaps that
>> constitute a part of a single VMA).
> 
> If I mremap() to create a hole and mremap() it back, I would assume to
> automatically get the hole closed again, without special flags. Well, we
> both know this is not the case :)
> 
>   > > This means that a user must concern themselves with whether merges
> succeed
>> or not should they wish to use mremap() in such a way which causes multiple
>> mremap() calls to be performed upon mappings.
> 
> Right.
> 
>>
>> This series provides users with an option to accept the overhead of
>> actually updating the VMA and underlying folios via the
>> MREMAP_RELOCATE_ANON flag.
> 
> Okay. I wish we could avoid this flag ...
> 
>>
>> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
>> the mremap() succeeding, then no attempt is made at relocation of folios as
>> this is not required.
> 
> Makes sense. This is the existing behavior then.
> 
>>
>> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
>> folio->index fields are appropriately updated in order that subsequent
>> mremap() or mprotect() calls will succeed in merging.
> 
> By looking at the surrounding VMAs or simply by trying to always keep
> the folio->index to correspond to the address in the VMA? (just if
> mremap() never happened, I assume?)
> 
>>
>> This flag falls back to the ordinary means of mremap() should the operation
>> not be feasible. It also transparently undoes the operation, carefully
>> holding rmap locks such that no racing rmap operation encounters incorrect
>> or missing VMAs.
> 
> I absolutely dislike this undo operation, really. :(
> 
> I hope we can find a way to just detect early whether this optimization
> would work.
> 
> Which are the exact error cases you can run into for un-doing?
> 
> I assume:
> 
> (a) cow-shared anon folio (can detect early)
> 
> (b) large folios crossing VMAs (TBD)
> 
> (c) KSM folios? Probably we could move them, I *think* we would have to
> update the ksm_rmap_item. Alternatively, we could indicate if a VMA had
> any KSM folios and give up early in the first version.

Looking at patch #1, I can see that we treat KSM folios as "success".

I would have thought we would have to update the corresponding 
"ksm_rmap_item" ... somehow, to keep the rmap working.

I know that Wei Yang (already on cc) is working on selftests, which I am 
yet to review, but he doesn't cover mremap() yet.


Looking at rmap_walk_ksm(), I am left a bit confused.

We walk all entries in the stable tree (ksm_rmap_item), looking in the 
anon_vma interval tree for the entry that corresponds to 
ksm_rmap_item->address.

	addr = rmap_item->address & PAGE_MASK;

	if (addr < vma->vm_start || addr >= vma->vm_end)
		continue;

So I would assume, already when we mremap() ... we are *already* 
breaking KSM rmap walkers? :) Or there is somewhere some magic that I am 
missing.

A KSM mremap test case for rmap would be nice ;)

-- 
Cheers,

David / dhildenb