lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1745307301.git.lorenzo.stoakes@oracle.com>
Date: Tue, 22 Apr 2025 09:09:19 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>, Pedro Falcato <pfalcato@...e.de>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: [RFC PATCH v2 00/10] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON

A long standing issue with VMA merging of anonymous VMAs is the requirement
to maintain both vma->vm_pgoff and anon_vma compatibility between merge
candidates.

For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
refer to virtual page offsets, that is, va >> PAGE_SHIFT.

However upon mremap() of an anonymous mapping that has been faulted (that
is, where vma->anon_vma != NULL), we would then need to walk page tables to
be able to access let alone manipulate folio->index, mapping fields to
permit an update of this virtual page offset.

Therefore in these instances, we do not do so, instead retaining the
virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
field, and of course consequently folio->index.

On each occasion we use linear_page_index() to determine the appropriate
offset, cleverly offset the vma->vm_pgoff field by the difference between
the virtual address and actual VMA start.

Doing so in effect fragments the virtual address space, meaning that we are
no longer able to merge these VMAs with adjacent ones that could, at least
theoretically, be merged.

This also creates a difference in behaviour, often surprising to users,
between mappings which are faulted and those which are not - as for the
latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.

This is problematic firstly because this proliferates kernel allocations
that are pure memory pressure - unreclaimable and unmovable -
i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.

Secondly, mremap() exhibits an implicit uAPI in that it does not permit
remaps which span multiple VMAs (though it does permit remaps that
constitute a part of a single VMA).

This means that a user must concern themselves with whether merges succeed
or not should they wish to use mremap() in such a way which causes multiple
mremap() calls to be performed upon mappings.

This series provides users with an option to accept the overhead of
actually updating the VMA and underlying folios via the
MREMAP_RELOCATE_ANON flag.

If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
the mremap() succeeding, then no attempt is made at relocation of folios as
this is not required.

Even if no merge is possible upon moving of the region, vma->vm_pgoff and
folio->index fields are appropriately updated in order that subsequent
mremap() or mprotect() calls will succeed in merging.

This flag falls back to the ordinary means of mremap() should the operation
not be feasible. It also transparently undoes the operation, carefully
holding rmap locks such that no racing rmap operation encounters incorrect
or missing VMAs.

In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
user needs to know whether or not the operation succeeded - this flag is
identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
the mremap() fails with -EFAULT.

Note that no-op mremap() operations (such as an unpopulated range, or a
merge that would trivially succeed already) will succeed under
MREMAP_MUST_RELOCATE_ANON.

mremap() already walks page tables, so it isn't an order of magntitude
increase in workload, but constitutes the need to walk to page table leaf
level and manipulate folios.

The operations all succeed under THP and in general are compatible with
underlying large folios of any size. In fact, the larger the folio, the
more efficient the operation is.

Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
on the same order of magnitude of ordinary mremap() operations, with both
exhibiting time to the proportion of the mapping which is populated.

Of course, mremap() operations that are entirely aligned are significantly
faster as they need only move a VMA and a smaller number of higher order
page tables, but this is unavoidable.

Use-cases:

* ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
  upon which makes use of extensive mremap() operations to perform
  defragmentation of objects, taking advantage of the plentiful available
  virtual address space in a 64-bit system.

  In instances where one VMA is faulted in and another not, merging is not
  possible, which leads to significant, unreclaimable, kernel metadata
  overhead and contention on the vm.max_map_count limit.

  This series eliminates the issue entirely.
* It was indicated that Android similarly moves memory around and
  encounters the very same issues as ZGC.
* SUSE indicate they have encountered similar issues as pertains to an
  internal client.

Alternative approaches:

In discussions at LSF/MM/BPF It was suggested that we could make this an
madvise() operation, however at this point it will be too late to correctly
perform the merge, requiring an unmap/remap which would be egregious.

It was further suggested that we simply defer the operation to the point at
which an mremap() is attempted on multiple immediately adjacent VMAs (that
is - to allow VMA fragmentation up until the point where it might cause
perceptible issues with uAPI).

This is problematic in that in the first instance - you accrue
fragmentation, and only if you were to try to move the fragmented objects
again would you resolve it.

Additionally you would not be able to handle the mprotect() case, and you'd
have the same issue as the madvise() approach in that you'd need to
essentially re-map each VMA.

Additionally it would become non-trivial to correctly merge the VMAs - if
there were more than 3, we would need to invent a new merging mechanism
specifically for this, hold locks carefully over each to avoid them
disappearing from beneath us and introduce a great deal of non-optional
complexity.

While imperfect, the mremap flag approach seems the least invasive most
workable solution (until further rework of the anon_vma mechanism can be
achieved!)

Testing:

* Significantly expanded self-tests, all of which are passing.
* Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
  mremap()'s.
* Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
  (kernel compilation, etc.)
* Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
  on real hardware.

History:

RFC v2:
* Added folio_mapcount() check on relocate anon to assert exclusively
  mapped as per Jann.
* Added check for anon_vma->num_children > nr_pages in
  should_relocate_anon() as per Jann.
* Separated out vma_had_uncowed_parents() into shared helper function and
  added vma_had_uncowed_children() to implement the above.
* Add comment clarifying why we do not require an rmap lock on the old VMA
  due to fork requiring an mmap write lock which we hold.
* Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
* Checked for folio pinning and abort if in place. We do so, because this
  implies the folio is being used by the kernel for a time longer than the
  time over which an mmap lock is held (which will not be held at the time
  of us manipulating the folio, as we hold the mmap write lock). We are
  manipulating mapping, index fields and being conservative (additionally
  mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
  pin isn't somehow relying on these not being manipulated. As per David.
* Propagated mapcount, maybe DMA pinned checks to large folio logic.
* Added folio splitting - on second thoughts, it would be a bit silly to
  simply disallow the request because of large folio misalignment, work
  around this by splitting the folio in this instance.
* Added very careful handling around rmap lock, making use of
  folio_anon_vma(), to ensure we do not deadlock on anon_vma.
* Prefer vm_normal_folio() to vm_normal_page() & page_folio().
* Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
* Provided sys_mremap() helper in vm_util.[ch] to be shared among test
  callers and de-duplicate. This must be a raw system call, as glibc will
  otherwise filter the flags.
* Expanded the mm CoW self-tests to explicitly test with
  MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
  exercises split_folio() code paths explicitly. Additionally some cases
  cannot succeed, so we also exercise undo paths.
* Added explicit lockdep handling to teach it that we are handling two
  distinct anon_vma locks so it doesn't spuriously report a deadlock.
* Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
  strictly be necessary as we explicitly limit ourselves to unforked
  anon_vma's, but it is more correct to do so, as this is where the lock is
  located.
* Expanded the split_huge_page_test.c test to also test using the
  MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.

RFC v1:
https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (10):
  mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
  mm/mremap: add MREMAP_MUST_RELOCATE_ANON
  mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
  tools UAPI: Update copy of linux/mman.h from the kernel sources
  tools/testing/selftests: add sys_mremap() helper to vm_util.h
  tools/testing/selftests: add mremap() cases that merge normally
  tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
  tools/testing/selftests: expand mremap() tests for
    MREMAP_RELOCATE_ANON
  tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
  tools/testing/selftests: test relocate anon in split huge page test

 include/uapi/linux/mman.h                     |    8 +-
 mm/internal.h                                 |    1 +
 mm/mremap.c                                   |  726 ++++++++-
 mm/vma.c                                      |   78 +-
 mm/vma.h                                      |   28 +-
 tools/include/uapi/linux/mman.h               |    8 +-
 tools/testing/selftests/mm/cow.c              |   23 +-
 tools/testing/selftests/mm/merge.c            | 1329 ++++++++++++++++-
 tools/testing/selftests/mm/mremap_test.c      |  262 ++--
 .../selftests/mm/split_huge_page_test.c       |   25 +-
 tools/testing/selftests/mm/vm_util.c          |    8 +
 tools/testing/selftests/mm/vm_util.h          |    3 +
 tools/testing/vma/vma.c                       |    5 +-
 tools/testing/vma/vma_internal.h              |   33 +
 14 files changed, 2363 insertions(+), 174 deletions(-)

--
2.49.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ