linux-kernel - Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c9075468-f8c8-4114-85df-c8b6afc6d8b4@lucifer.local>
Date: Tue, 17 Jun 2025 11:50:26 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Matthew Wilcox <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
        Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
        Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
        Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On Mon, Jun 16, 2025 at 10:24:05PM +0200, David Hildenbrand wrote:
> Hi Lorenzo,
>
> as discussed offline, there is a lot going on an this is rather ... a lot of
> code+complexity for something that is more a corner cases. :)
>
> Corner-case as in: only select user space will benefit from this, which is
> really a shame.

Right, but this is why there's a flag for it. If you don't want to use it, you
don't have to.

I mean one can argue many things in the kernel are attacking corner cases, I
don't think that's an argument against a feature. mremap() _itself_ is a corner
case :)

On a longer-term aside: I'd like to address this in a far more broad fashion, in
fact I literally am now co-maintaining rmap with you largely because I want to
do this :P

So believe me, this is something that will be at least _tried_. But in the
meantime, the idea is we provide a means to work around a very major limitation
of anon remap.

>
> After your presentation at LSF/MM, I thought about this further, and I was
> wondering whether:
>
> (a) We cannot make this semi-automatic, avoiding flags.

I've addressed the suggestions from LSF/MM in the cover letter below.

I don't think this is possible, largely because of the issues around how we
figure out the anon_vma to attach to.

>
> (b) We cannot simplify further by limiting it to the common+easy cases
> first.
>
> I think you already to some degree did b) as part of this non-RFC, which is
> great.

Main simpllifications are - we never touch anything CoW'd, we only allow 'true'
anon.

Well we focus on 'true' anon first (i.e. no MAP_PRIVATE) so we simplify that
way. Otherwise it's pretty complete.

>
>
> So before digging into the details, let's discuss the high level problem
> briefly.
>
> I think there are three parts to it:
>
> (1) Detecting whether it is safe to adjust the folio->index (small
>     folios)
>
> (2) Performance implications of doing so
>
> (3) Detecting whether it is safe to adjust the folio->index (large PTE-
>     mapped  folios)

I think you're forgetting folio->mapping also.

This is where a lot of the complexity is - it's rather chicken-and-egg. You need
to:

a. Know that you cannot currently merge with another anon VMA (and thus avoid
   having to do any of this).
b. Have a new VMA with an anon_vma to which you can relocate the folio.
c. Have that anon_vma locked...

>
>
> Regarding (1), if we simply track whether a folio was ever used for
> COW-sharing, it would be very easy: and not only for present folios, but for
> any anon folios that are referenced by swap/migration entries. Skimming over
> patch #1, I think you apply a similar logic, which is good.

Right.

>
> Regarding (2), it would apply when we mremap() anon VMAs and they happen to
> reside next to other anon VMAs. Which workloads are we concerned about
> harming by implementing this optimization? I recall that the most common use
> case for mremap() is actually for file mappings, but I might be wrong. In
> any case, we could just have a different way to enable this optimization
> than for each and every mremap() invocation in a process.

Yeah we're getting into prctl, mctl hellscape here if we go down that road. And
I want to be conservative here. Having it as an mremap() flag doesn't prevent us
from later doing something policy-ish.

>
> Regarding (3), if we were to split large folios that cross VMA boundaries
> during mremap(), it would be simpler.

The code does that.

>
> How is it handled in this series if we large folio crosses VMA boundaries?
> (a) try splitting or (b) fail (not transparent to the user :( ).

a.

This was a painful thing to work on...

>
>
> > This also creates a difference in behaviour, often surprising to users,
> > between mappings which are faulted and those which are not - as for the
> > latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> >
> > This is problematic firstly because this proliferates kernel allocations
> > that are pure memory pressure - unreclaimable and unmovable -
> > i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> > remaps which span multiple VMAs (though it does permit remaps that
> > constitute a part of a single VMA).
>
> If I mremap() to create a hole and mremap() it back, I would assume to
> automatically get the hole closed again, without special flags. Well, we
> both know this is not the case :)

This is a profoundly confusing thing for users, sadly.

>
> > > This means that a user must concern themselves with whether merges
> succeed
> > or not should they wish to use mremap() in such a way which causes multiple
> > mremap() calls to be performed upon mappings.
>
> Right.
>
> >
> > This series provides users with an option to accept the overhead of
> > actually updating the VMA and underlying folios via the
> > MREMAP_RELOCATE_ANON flag.
>
> Okay. I wish we could avoid this flag ...

Me too... hey I've run kernels with this flag just turned on by default and they
seemed fine ;)

>
> >
> > If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> > the mremap() succeeding, then no attempt is made at relocation of folios as
> > this is not required.
>
> Makes sense. This is the existing behavior then.

Yes, so we have a sane fallback.

>
> >
> > Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> > folio->index fields are appropriately updated in order that subsequent
> > mremap() or mprotect() calls will succeed in merging.
>
> By looking at the surrounding VMAs or simply by trying to always keep the
> folio->index to correspond to the address in the VMA? (just if mremap()
> never happened, I assume?)

This is actually address future mprotect merges for instance (e.g. immediately
adjacent non-compatible VMA gets mprotect()'d to something compatible), or if
other VMAs are mapped adjacent to the moved VMA etc.

It just means, if you set this flag, and the operation succeeds, we will still
change vma->vm_pgoff and folio->index such that the VMA is mergeable with
immediately adjacent, compatible VMAs.

>
> >
> > This flag falls back to the ordinary means of mremap() should the operation
> > not be feasible. It also transparently undoes the operation, carefully
> > holding rmap locks such that no racing rmap operation encounters incorrect
> > or missing VMAs.
>
> I absolutely dislike this undo operation, really. :(

Yes me too. It's a complete horror show.

>
> I hope we can find a way to just detect early whether this optimization
> would work.

Well, the problem is if we encounter something at the folio level right? If
something is unexpected, what then? No matter what we have to clean up our
mess.

We do try our best to ensure that things will succeed.

>
> Which are the exact error cases you can run into for un-doing?
>
> I assume:
>
> (a) cow-shared anon folio (can detect early)

Yes we should.

>
> (b) large folios crossing VMAs (TBD)

Addressed see later patches in series.

>
> (c) KSM folios? Probably we could move them, I *think* we would have to
> update the ksm_rmap_item. Alternatively, we could indicate if a VMA had any
> KSM folios and give up early in the first version.
>
> (d) GUP pins: I think we could allow that ... folio_maybe_dma_pinned() is
> racy either way (GUP-fast!). To deal with GUP-fast we would have to play
> different games ...
>
> Anything else?

Well given the bug report in the thread , we also now have a failure to
obtain the folio lock because we hold PTE lock as a thing.

We could address that with lockless PTE traversal though.

Or we could do what we do in the folio_test_large() handling in
relocate_anon_pte() where we drop/reacquire...

We also have the case where, upon trying to split, we encounter a folio
which already has the currently locked anon_vma set. I can investigate
further how this can happen to determine if we can detect ahead of time.

Finally the folio split can fail...

I feel like we're on thin ice if we try to make an assumption that a
relocate can always succeed.

>
> >
> > In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> > user needs to know whether or not the operation succeeded - this flag is
> > identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> > the mremap() fails with -EFAULT.
>
> How would an APP deal with these errors? Do you have a user in mind that
> could do something sensible based on this error?

Well it's the only way to know if what you wanted actually happened or
not. I guarantee you, that people will complain if the issues they use this
to fix aren't always resolved by this.

They could also use for some retry logic potentially also.

>
> I'm having a hard time imagining that :)

It's useful for testing at the very least, very useful indeed so on this
basis it's worth having and doesn't add too much complexity.

>
> >
> > Note that no-op mremap() operations (such as an unpopulated range, or a
> > merge that would trivially succeed already) will succeed under
> > MREMAP_MUST_RELOCATE_ANON.
> >
> > mremap() already walks page tables, so it isn't an order of magntitude
> > increase in workload, but constitutes the need to walk to page table leaf
> > level and manipulate folios.
>
> Only for anon VMAs, though. Do you have some numbers how bad it is? I mean,
> mremap() is already a pretty invasive/expensive operation ... :) ... which
> is why people started using uffdio_move instead, to avoid  the heavy-weight
> locks.

I got a whole bunch of numbers, I mean things were always within the same
order-of-magnitude, however things are much slower if the existing logic
could just move a higher order page table entry rather than having to
traverse folios, obviously.

I do feel that mremap() perf shouldn't be a consideration given how
heavy-handed it is already as you say. But I'm not sure everybody will
share that view...

>
> >
> > The operations all succeed under THP and in general are compatible with
> > underlying large folios of any size. In fact, the larger the folio, the
> > more efficient the operation is.
>
> Yes.
>
> >
> > Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> > on the same order of magnitude of ordinary mremap() operations, with both
> > exhibiting time to the proportion of the mapping which is populated.
> >
> > Of course, mremap() operations that are entirely aligned are significantly
> > faster as they need only move a VMA and a smaller number of higher order
> > page tables, but this is unavoidable.
> >
> > Previous efforts in this area
> > =============================
> >
> > An approach addressing this issue was previously suggested by Jakub Matena
> > in a series posted a few years ago in [0] (and discussed in a masters
> > thesis).
> >
> > However this was a more general effort which attempted to always make
> > anonymous mappings more mergeable, and therefore was not quite ready for
> > the upstream limelight. In addition, large folio work which has occurred
> > since requires us to carefully consider and account for this.
> >
> > This series is more conservative and targeted (one must specific a flag to
> > get this behaviour) and additionally goes to great efforts to handle large
> > folios and account all of the nitty gritty locking concerns that might
> > arise in current kernel code.
> >
> > Thanks goes out to Jakub for his efforts however, and hopefully this effort
> > to take a slightly different approach to the same problem is pleasing to
> > him regardless :)
> >
> > [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> >
> > Use-cases
> > =========
> >
> > * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
> >    upon which makes use of extensive mremap() operations to perform
> >    defragmentation of objects, taking advantage of the plentiful available
> >    virtual address space in a 64-bit system.
> >
> >    In instances where one VMA is faulted in and another not, merging is not
> >    possible, which leads to significant, unreclaimable, kernel metadata
> >    overhead and contention on the vm.max_map_count limit.
> >
> >    This series eliminates the issue entirely.
> > * It was indicated that Android similarly moves memory around and
> >    encounters the very same issues as ZGC.
>
> Isn't Android using uffdio_move?

I stated this only based on what I was told, I didn't dig deep.

>
> > * SUSE indicate they have encountered similar issues as pertains to an
> >    internal client.
> >
> > Past approaches
> > ===============
> >
> > In discussions at LSF/MM/BPF It was suggested that we could make this an
> > madvise() operation, however at this point it will be too late to correctly
> > perform the merge, requiring an unmap/remap which would be egregious.
> >
> > It was further suggested that we simply defer the operation to the point at
> > which an mremap() is attempted on multiple immediately adjacent VMAs (that
> > is - to allow VMA fragmentation up until the point where it might cause
> > perceptible issues with uAPI).
> >
> > This is problematic in that in the first instance - you accrue
> > fragmentation, and only if you were to try to move the fragmented objects
> > again would you resolve it.
> >
> > Additionally you would not be able to handle the mprotect() case, and you'd
> > have the same issue as the madvise() approach in that you'd need to
> > essentially re-map each VMA.
> >
> > Additionally it would become non-trivial to correctly merge the VMAs - if
> > there were more than 3, we would need to invent a new merging mechanism
> > specifically for this, hold locks carefully over each to avoid them
> > disappearing from beneath us and introduce a great deal of non-optional
> > complexity.
> >
> > While imperfect, the mremap flag approach seems the least invasive most
> > workable solution (until further rework of the anon_vma mechanism can be
> > achieved!)
>
> Well, at that point we already have these new flags ... :(
>
> >
> >   include/linux/rmap.h                          |    4 +
> >   include/uapi/linux/mman.h                     |    8 +-
> >   mm/internal.h                                 |    1 +
> >   mm/mremap.c                                   |  719 ++++++-
> >   mm/vma.c                                      |   77 +-
> >   mm/vma.h                                      |   36 +-
>
> ~ +40% on LOC on mm/mremap.c :(

SLOC is a terrible measure :) I'd suggest counting how much of those are
comments... :)

The mremap() refactor added a bunch of SLOC but a lot of that was comments,
and breaking out very confusing logic into logical parts etc. It also added
more lines than that...

Unfortunately though trying to do anything like this involves added
complexity. I did try to keep it as minimal as possible...

>
> --
> Cheers,
>
> David / dhildenb
>