[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dcdb3478-68ea-4d9f-af4f-2f5438de45d2@suse.cz>
Date: Fri, 11 Jul 2025 15:34:23 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Peter Xu <peterx@...hat.com>, Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>, Jann Horn <jannh@...gle.com>,
Pedro Falcato <pfalcato@...e.de>, Rik van Riel <riel@...riel.com>,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
+cc linux-api - see the description of the new behavior below
On 7/11/25 13:38, Lorenzo Stoakes wrote:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
>
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
>
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
>
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
>
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
>
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
>
> We only permit this for mremap() operations that do NOT change the size of
> the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
>
> Should no VMA exist in the range, -EFAULT is returned as usual.
>
> If a VMA move spans a single VMA - then there is no functional change.
>
> Otherwise, we place additional requirements upon VMAs:
>
> * They must not have a userfaultfd context associated with them - this
> requires dropping the lock to notify users, and we want to perform the
> operation with the mmap write lock held throughout.
>
> * If file-backed, they cannot have a custom get_unmapped_area handler -
> this might result in MREMAP_FIXED not being honoured, which could result
> in unexpected positioning of VMAs in the moved region.
>
> There may be gaps in the range of VMAs that are moved:
>
> X Y X Y
> <---> <-> <---> <->
> |-------| |-----| |-----| |-------| |-----| |-----|
> | A | | B | | C | ---> | A' | | B' | | C' |
> |-------| |-----| |-----| |-------| |-----| |-----|
> addr new_addr
>
> The move will preserve the gaps between each VMA.
AFAIU "moving a gap" doesn't mean we unmap anything pre-existing where the
moved gap's range falls to, right? Worth pointing out explicitly.
> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
>
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.
>
> We don't try to assess ahead of time whether VMAs are valid according to
> the multi VMA rules, as it would be rather unusual for a user to mix
> uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
> specify custom get_unmapped_area() handlers in an aggregate operation.
>
> So we optimise for the far, far more likely case of the operation being
> entirely permissible.
Guess it's the sanest thing to do given all the cirumstances.
> In the case of the move of a single VMA, the above conditions are
> permitted. This makes the behaviour identical for a single VMA as before.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Reviewed-by: Vlastimil Babka <vbabka@...e.cz>
Some nits:
> ---
> mm/mremap.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 150 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 8cb08ccea6ad..59f49de0f84e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -69,6 +69,8 @@ struct vma_remap_struct {
> enum mremap_type remap_type; /* expand, shrink, etc. */
> bool mmap_locked; /* Is mm currently write-locked? */
> unsigned long charged; /* If VM_ACCOUNT, # pages to account. */
> + bool seen_vma; /* Is >1 VMA being moved? */
Seems this could be local variable of remap_move().
> + bool vmi_needs_reset; /* Was the VMA iterator invalidated? */
> };
>
> static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> @@ -1188,6 +1190,9 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> *new_vma_ptr = NULL;
> return -ENOMEM;
> }
A newline here?
> + if (vma != vrm->vma)
> + vrm->vmi_needs_reset = true;
A comment on what this condition means wouldn't hurt? Is it when "Source vma
may have been merged into new_vma" in copy_vma(), or when not?
(no comments bellow, remaining quoted part left for linux-api reference)
> +
> vrm->vma = vma;
> pmc.old = vma;
> pmc.new = new_vma;
> @@ -1583,6 +1588,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
> return false;
> }
>
> +/* Does this remap ONLY move mappings? */
> +static bool vrm_move_only(struct vma_remap_struct *vrm)
> +{
> + if (!(vrm->flags & MREMAP_FIXED))
> + return false;
> +
> + if (vrm->old_len != vrm->new_len)
> + return false;
> +
> + return true;
> +}
> +
> static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
> {
> struct mm_struct *mm = current->mm;
> @@ -1597,6 +1614,32 @@ static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
> userfaultfd_unmap_complete(mm, vrm->uf_unmap);
> }
>
> +static bool vma_multi_allowed(struct vm_area_struct *vma)
> +{
> + struct file *file;
> +
> + /*
> + * We can't support moving multiple uffd VMAs as notify requires
> + * mmap lock to be dropped.
> + */
> + if (userfaultfd_armed(vma))
> + return false;
> +
> + /*
> + * Custom get unmapped area might result in MREMAP_FIXED not
> + * being obeyed.
> + */
> + file = vma->vm_file;
> + if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) {
> + const struct file_operations *fop = file->f_op;
> +
> + if (fop->get_unmapped_area)
> + return false;
> + }
> +
> + return true;
> +}
> +
> static int check_prep_vma(struct vma_remap_struct *vrm)
> {
> struct vm_area_struct *vma = vrm->vma;
> @@ -1646,7 +1689,19 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
> (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
> return -EINVAL;
>
> - /* We can't remap across vm area boundaries */
> + /*
> + * We can't remap across the end of VMAs, as another VMA may be
> + * adjacent:
> + *
> + * addr vma->vm_end
> + * |-----.----------|
> + * | . |
> + * |-----.----------|
> + * .<--------->xxx>
> + * old_len
> + *
> + * We also require that vma->vm_start <= addr < vma->vm_end.
> + */
> if (old_len > vma->vm_end - addr)
> return -EFAULT;
>
> @@ -1746,6 +1801,90 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> return 0;
> }
>
> +static unsigned long remap_move(struct vma_remap_struct *vrm)
> +{
> + struct vm_area_struct *vma;
> + unsigned long start = vrm->addr;
> + unsigned long end = vrm->addr + vrm->old_len;
> + unsigned long new_addr = vrm->new_addr;
> + unsigned long target_addr = new_addr;
> + unsigned long res = -EFAULT;
> + unsigned long last_end;
> + bool allowed = true;
> + VMA_ITERATOR(vmi, current->mm, start);
> +
> + /*
> + * When moving VMAs we allow for batched moves across multiple VMAs,
> + * with all VMAs in the input range [addr, addr + old_len) being moved
> + * (and split as necessary).
> + */
> + for_each_vma_range(vmi, vma, end) {
> + /* Account for start, end not aligned with VMA start, end. */
> + unsigned long addr = max(vma->vm_start, start);
> + unsigned long len = min(end, vma->vm_end) - addr;
> + unsigned long offset, res_vma;
> +
> + if (!allowed)
> + return -EFAULT;
> +
> + /* No gap permitted at the start of the range. */
> + if (!vrm->seen_vma && start < vma->vm_start)
> + return -EFAULT;
> +
> + /*
> + * To sensibly move multiple VMAs, accounting for the fact that
> + * get_unmapped_area() may align even MAP_FIXED moves, we simply
> + * attempt to move such that the gaps between source VMAs remain
> + * consistent in destination VMAs, e.g.:
> + *
> + * X Y X Y
> + * <---> <-> <---> <->
> + * |-------| |-----| |-----| |-------| |-----| |-----|
> + * | A | | B | | C | ---> | A' | | B' | | C' |
> + * |-------| |-----| |-----| |-------| |-----| |-----|
> + * new_addr
> + *
> + * So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
> + */
> + offset = vrm->seen_vma ? vma->vm_start - last_end : 0;
> + last_end = vma->vm_end;
> +
> + vrm->vma = vma;
> + vrm->addr = addr;
> + vrm->new_addr = target_addr + offset;
> + vrm->old_len = vrm->new_len = len;
> +
> + allowed = vma_multi_allowed(vma);
> + if (vrm->seen_vma && !allowed)
> + return -EFAULT;
> +
> + res_vma = check_prep_vma(vrm);
> + if (!res_vma)
> + res_vma = mremap_to(vrm);
> + if (IS_ERR_VALUE(res_vma))
> + return res_vma;
> +
> + if (!vrm->seen_vma) {
> + VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
> + res = res_vma;
> + }
> +
> + /* mmap lock is only dropped on shrink. */
> + VM_WARN_ON_ONCE(!vrm->mmap_locked);
> + /* This is a move, no expand should occur. */
> + VM_WARN_ON_ONCE(vrm->populate_expand);
> +
> + if (vrm->vmi_needs_reset) {
> + vma_iter_reset(&vmi);
> + vrm->vmi_needs_reset = false;
> + }
> + vrm->seen_vma = true;
> + target_addr = res_vma + vrm->new_len;
> + }
> +
> + return res;
> +}
> +
> static unsigned long do_mremap(struct vma_remap_struct *vrm)
> {
> struct mm_struct *mm = current->mm;
> @@ -1763,13 +1902,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
> return -EINTR;
> vrm->mmap_locked = true;
>
> - vrm->vma = vma_lookup(current->mm, vrm->addr);
> - res = check_prep_vma(vrm);
> - if (res)
> - goto out;
> + if (vrm_move_only(vrm)) {
> + res = remap_move(vrm);
> + } else {
> + vrm->vma = vma_lookup(current->mm, vrm->addr);
> + res = check_prep_vma(vrm);
> + if (res)
> + goto out;
>
> - /* Actually execute mremap. */
> - res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> + /* Actually execute mremap. */
> + res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> + }
>
> out:
> failed = IS_ERR_VALUE(res);
Powered by blists - more mailing lists