linux-kernel - Re: [PATCH 09/10] mm/mremap: permit mremap() move of multiple VMAs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vn7djxc5cnadmqxsxtd7frgx346fnsvfvlzyh253dxnwn3pe7f@kqci2ing4coj>
Date: Wed, 9 Jul 2025 14:13:41 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Peter Xu <peterx@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, Rik van Riel <riel@...riel.com>,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH 09/10] mm/mremap: permit mremap() move of multiple VMAs

* Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [250707 01:28]:
> Historically we've made it a uAPI requirement that mremap() may only
> operate on a single VMA at a time.
> 
> For instances where VMAs need to be resized, this makes sense, as it
> becomes very difficult to determine what a user actually wants should they
> indicate a desire to expand or shrink the size of multiple VMAs (truncate?
> Adjust sizes individually? Some other strategy?).
> 
> However, in instances where a user is moving VMAs, it is restrictive to
> disallow this.
> 
> This is especially the case when anonymous mapping remap may or may not be
> mergeable depending on whether VMAs have or have not been faulted due to
> anon_vma assignment and folio index alignment with vma->vm_pgoff.
> 
> Often this can result in surprising impact where a moved region is faulted,
> then moved back and a user fails to observe a merge from otherwise
> compatible, adjacent VMAs.
> 
> This change allows such cases to work without the user having to be
> cognizant of whether a prior mremap() move or other VMA operations has
> resulted in VMA fragmentation.
> 
> Having refactored mremap code to aggregate per-VMA and parameter checks, we
> are now in a position to permit this kind of move.
> 
> We do so by detecting if this is a move-only operation up-front, and then
> utilising a separate code path via remap_move() rather than the ordinary
> single-VMA path.
> 
> There are two tasks that occur outside of the mmap write lock - userfaultfd
> notification and population of unmapped regions of expanded VMAs should the
> VMA be mlock()'d.
> 
> The latter doesn't apply, as this is logic for a move only and thus no
> expansion can take place. In the former case, we explicitly disallow
> multi-VMA operations on uffd-armed VMAs.
> 
> The mmap lock is never dropped in the move-only case, this only occurs on a
> VMA shrink.
> 
> We take care to handle cases where a VMA merge has occurred, by resetting
> the VMA iterator in such instances.
> 
> We needn't worry about self-merges, as in those cases we would, by
> definition, not be spanning multiple VMAs. The overlapping range test is
> performed on the whole range so specifically disallows this.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> ---
>  mm/mremap.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 99 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 28e776cddc08..2e6005e1d22c 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -69,6 +69,8 @@ struct vma_remap_struct {
>  	enum mremap_type remap_type;	/* expand, shrink, etc. */
>  	bool mmap_locked;		/* Is mm currently write-locked? */
>  	unsigned long charged;		/* If VM_ACCOUNT, # pages to account. */
> +	bool multi_vma;			/* Is >1 VMA being moved? */
> +	bool vma_reset;			/* Was the VMA merged/unmap occur? */

The name doesn't read well in code.  vmi_reset or reset_iter might be
better, but I don't really mind it like this.

>  };
>  
>  static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> @@ -1111,6 +1113,7 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
>  
>  	err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false);
>  	vrm->vma = NULL; /* Invalidated. */
> +	vrm->vma_reset = true;

I believe the munmap() operation leaves the vmi in the correct position
to reuse, so this is cautious that costs an extra walk of the tree.  I
don't think it's critical to performance, but if it is we can look here.
It would have to be passed through which might be a pain.

>  	if (err) {
>  		/* OOM: unable to split vma, just get accounts right */
>  		vm_acct_memory(len >> PAGE_SHIFT);
> @@ -1181,6 +1184,7 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  
>  	new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
>  			   &pmc.need_rmap_locks);
> +	vrm->vma_reset = vma != vrm->vma;
>  	if (!new_vma) {
>  		vrm_uncharge(vrm);
>  		*new_vma_ptr = NULL;
> @@ -1325,6 +1329,7 @@ static unsigned long shrink_vma(struct vma_remap_struct *vrm,
>  	res = do_vmi_munmap(&vmi, mm, unmap_start, unmap_bytes,
>  			    vrm->uf_unmap, drop_lock);
>  	vrm->vma = NULL; /* Invalidated. */
> +	vrm->vma_reset = true;

Ditto here, lock depending..

>  	if (res)
>  		return res;
>  
> @@ -1362,6 +1367,7 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm)
>  		err = do_munmap(mm, vrm->new_addr, vrm->new_len,
>  				vrm->uf_unmap_early);
>  		vrm->vma = NULL; /* Invalidated. */
> +		vrm->vma_reset = true;

Pretty sure this one is needed, regardless of passing through (and
updating this call).

>  		if (err)
>  			return err;
>  
> @@ -1581,6 +1587,18 @@ static bool vrm_will_map_new(struct vma_remap_struct *vrm)
>  	return false;
>  }
>  
> +/* Does this remap ONLY move mappings? */
> +static bool vrm_move_only(struct vma_remap_struct *vrm)
> +{
> +	if (!vrm_implies_new_addr(vrm))
> +		return false;
> +
> +	if (vrm->old_len != vrm->new_len)
> +		return false;
> +
> +	return true;
> +}
> +
>  static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1644,10 +1662,29 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
>  			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
>  		return -EINVAL;
>  
> -	/* We can't remap across vm area boundaries */
> +	/*
> +	 * We can't remap across the end of VMAs, as another VMA may be
> +	 * adjacent:
> +	 *
> +	 *       addr   vma->vm_end
> +	 *  |-----.----------|
> +	 *  |     .          |
> +	 *  |-----.----------|
> +	 *        .<--------->xxx>
> +	 *            old_len
> +	 *
> +	 * We also require that vma->vm_start <= addr < vma->vm_end.
> +	 */
>  	if (old_len > vma->vm_end - addr)
>  		return -EFAULT;
>  
> +	/*
> +	 * We can't support moving multiple uffd VMAs as notify requires mmap
> +	 * lock to be dropped.
> +	 */
> +	if (vrm->multi_vma && userfaultfd_armed(vma))
> +		return -EINVAL;
> +
>  	if (new_len <= old_len)
>  		return 0;
>  
> @@ -1744,6 +1781,57 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
>  	return 0;
>  }
>  
> +static unsigned long remap_move(struct vma_remap_struct *vrm)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long start = vrm->addr;
> +	unsigned long end = vrm->addr + vrm->old_len;
> +	unsigned long new_addr = vrm->new_addr;
> +	unsigned long prev_addr = start;
> +	VMA_ITERATOR(vmi, current->mm, start);
> +
> +	/*
> +	 * When moving VMAs we allow for batched moves across multiple VMAs,
> +	 * with all VMAs in the input range [addr, addr + old_len) being moved
> +	 * (and split as necessary).
> +	 */
> +	for_each_vma_range(vmi, vma, end) {
> +		unsigned long addr = max(vma->vm_start, start);
> +		unsigned long len = min(end, vma->vm_end) - addr;
> +		unsigned long offset = addr - start;
> +		unsigned long res;
> +
> +		/* Merged with self, move on. */
> +		if (vrm->multi_vma && prev_addr == addr)
> +			continue;
> +
> +		vrm->vma = vma;
> +		vrm->addr = addr;
> +		vrm->new_addr = new_addr + offset;
> +		vrm->old_len = vrm->new_len = len;
> +
> +		res = check_prep_vma(vrm);
> +		if (!res)
> +			res = mremap_to(vrm);
> +		if (IS_ERR_VALUE(res))
> +			return res;
> +
> +		/* mmap lock is only dropped on shrink. */
> +		VM_WARN_ON_ONCE(!vrm->mmap_locked);
> +		/* This is a move, no expand should occur. */
> +		VM_WARN_ON_ONCE(vrm->populate_expand);
> +
> +		if (vrm->vma_reset) {
> +			vma_iter_reset(&vmi);
> +			vrm->vma_reset = false;
> +		}

What code path results in vma_reset == false here?

> +		vrm->multi_vma = true;
> +		prev_addr = addr;
> +	}
> +
> +	return new_addr;
> +}

The iterator use looks good.

> +
>  static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1761,13 +1849,17 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm)
>  		return -EINTR;
>  	vrm->mmap_locked = true;
>  
> -	vrm->vma = vma_lookup(current->mm, vrm->addr);
> -	res = check_prep_vma(vrm);
> -	if (res)
> -		goto out;
> +	if (vrm_move_only(vrm)) {
> +		res = remap_move(vrm);
> +	} else {
> +		vrm->vma = vma_lookup(current->mm, vrm->addr);
> +		res = check_prep_vma(vrm);
> +		if (res)
> +			goto out;
>  
> -	/* Actually execute mremap. */
> -	res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +		/* Actually execute mremap. */
> +		res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> +	}
>  
>  out:
>  	failed = IS_ERR_VALUE(res);
> -- 
> 2.50.0
>