linux-kernel - Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f24dd244-f188-4804-981c-8b7560e5a26b@redhat.com>
Date: Mon, 16 Jun 2025 22:58:28 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Matthew Wilcox
 <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
 Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
 Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
 Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
 Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On 09.06.25 15:26, Lorenzo Stoakes wrote:
> When mremap() moves a mapping around in memory, it goes to great lengths to
> avoid having to walk page tables as this is expensive and
> time-consuming.
> 
> Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> well all the folio indexes pointed at the associated anon_vma object.
> 
> This means the VMA and page tables can simply be moved and this affects the
> change (and if we can move page tables at a higher page table level, this
> is even faster).
> 
> While this is efficient, it does lead to big problems with VMA merging - in
> essence it causes faulted anonymous VMAs to not be mergeable under many
> circumstances once moved.
> 
> This is limiting and leads to both a proliferation of unreclaimable,
> unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> impact on further use of mremap(), which has a requirement that the VMA
> moved (which can also be a partial range within a VMA) may span only a
> single VMA.
> 
> This makes the mergeability or not of VMAs in effect a uAPI concern.
> 
> In some use cases, users may wish to accept the overhead of actually going
> to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> provide them with the choice.
> 
> This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> attempts to perform such an operation. If it is unable to do so, it cleanly
> falls back to the usual method.
> 
> It carefully takes the rmap locks such that at no time will a racing rmap
> user encounter incorrect or missing VMAs.
> 
> It is also designed to interact cleanly with the existing mremap() error
> fallback mechanism (inverting the remap should the page table move fail).
> 
> Also, if we could merge cleanly without such a change, we do so, avoiding
> the overhead of the operation if it is not required.
> 
> In the instance that no merge may occur when the move is performed, we
> still perform the folio and VMA updates to ensure that future mremap() or
> mprotect() calls will result in merges.
> 
> In this implementation, we simply give up if we encounter large folios. A
> subsequent commit will extend the functionality to allow for these cases.
> 
> We restrict this flag to purely anonymous memory only.
> 
> we separate out the vma_had_uncowed_parents() helper function for checking
> in should_relocate_anon() and introduce a new function
> vma_maybe_has_shared_anon_folios() which combines a check against this and
> any forked child anon_vma's.
> 
> We carefully check for pinned folios in case a caller who holds a pin might
> make assumptions about index, mapping fields which we are about to
> manipulate.

Som quick feedback, I did not yet digest everything.

[...]

> +/*
> + * If the folio mapped at the specified pte entry can have its index and mapping
> + * relocated, then do so.
> + *
> + * Returns the number of pages we have traversed, or 0 if the operation failed.
> + */
> +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> +		struct pte_state *state, bool undo)
> +{
> +	struct folio *folio;
> +	struct vm_area_struct *old, *new;
> +	pgoff_t new_index;
> +	pte_t pte;
> +	unsigned long ret = 1;
> +	unsigned long old_addr = state->old_addr;
> +	unsigned long new_addr = state->new_addr;
> +
> +	old = pmc->old;
> +	new = pmc->new;
> +
> +	pte = ptep_get(state->ptep);
> +
> +	/* Ensure we have truly got an anon folio. */
> +	folio = vm_normal_folio(old, old_addr, pte);
> +	if (!folio)
> +		return ret;
> +
> +	folio_lock(folio);
> +
> +	/* No-op. */
> +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> +		goto out;
> +

So these cases are all "pass".

 > +	/*> +	 * This should never be the case as we have already checked 
to ensure
> +	 * that the anon_vma is not forked, and we have just asserted that it is
> +	 * anonymous.
> +	 */
> +	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
> +		goto out;

Good a warning, so we should be able to handle that early.

> +	/* The above check should imply these. */
> +	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
 > +	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));

This can trigger in one nasty case, where we can lose the PAE bit during 
swapin (refault from the swapcache while the folio is under writeback, 
and the device does not allow for modifying the data while under writeback).

> +
> +	/*
> +	 * A pinned folio implies that it will be used for a duration longer
> +	 * than that over which the mmap_lock is held, meaning that another part
> +	 * of the kernel may be making use of this folio.
> +	 *
> +	 * Since we are about to manipulate index & mapping fields, we cannot
> +	 * safely proceed because whatever has pinned this folio may then
> +	 * incorrectly assume these do not change.
> +	 */
> +	if (folio_maybe_dma_pinned(folio))
> +		goto out;

As discussed, this can race with GUP-fast. SO *maybe* we can just allow 
for moving these.

(after all we still have ordinary GUP that would also not be covered by 
this check)

> +
> +	/*
> +	 * This should not happen as we explicitly disallow this, but check
> +	 * anyway.
> +	 */
> +	if (folio_test_large(folio)) {
> +		ret = 0;
> +		goto out;
> +	}

That is the only real problem for rollback so far I assume.

> +
> +	if (!undo)
> +		new_index = linear_page_index(new, new_addr);
> +	else
> +		new_index = linear_page_index(old, old_addr);
> +
> +	/*
> +	 * The PTL should keep us safe from unmapping, and the fact the folio is
> +	 * a PTE keeps the folio referenced.
> +	 *
> +	 * The mmap/VMA locks should keep us safe from fork and other processes.
> +	 *
> +	 * The rmap locks should keep us safe from anything happening to the
> +	 * VMA/anon_vma.
> +	 *
> +	 * The folio lock should keep us safe from reclaim, migration, etc.
> +	 */
> +	folio_move_anon_rmap(folio, undo ? old : new);
> +	WRITE_ONCE(folio->index, new_index);
> +
> +out:
> +	folio_unlock(folio);
> +	return ret;
> +}
> +
> +static bool pte_done(struct pte_state *state)
> +{
> +	return state->old_addr >= state->old_end;
> +}
> +
> +static void pte_next(struct pte_state *state, unsigned long nr_pages)
> +{
> +	state->old_addr += nr_pages * PAGE_SIZE;
> +	state->new_addr += nr_pages * PAGE_SIZE;
> +	state->ptep += nr_pages;
> +}
> +
> +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> +		unsigned long extent, pmd_t *pmdp, bool undo)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct pte_state state = {
> +		.old_addr = pmc->old_addr,
> +		.new_addr = pmc->new_addr,
> +		.old_end = pmc->old_addr + extent,
> +	};
> +	pte_t *ptep_start;
> +	bool ret;
> +	unsigned long nr_pages;
> +
> +	ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
> +	/*
> +	 * We prevent faults with mmap write lock, hold the rmap lock and should
> +	 * not fail to obtain this lock. Just give up if we can't.
> +	 */
> +	if (!ptep_start)
> +		return false;
> +
> +	state.ptep = ptep_start;
> +	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> +		pte_t pte = ptep_get(state.ptep);
> +
> +		if (pte_none(pte) || !pte_present(pte)) {
> +			nr_pages = 1;

What if we have

(a) A migration entry (possibly we might fail migration and simply remap 
the original folio)

(b) A swap entry with a folio in the swapcache that we can refault.

I don't think we can simply skip these ...

> +			continue;
> +		}
> +
> +		nr_pages = relocate_anon_pte(pmc, &state, undo);
> +		if (!nr_pages) {
> +			ret = false;
> +			goto out;
> +		}
> +	}
> +
> +	ret = true;
> +out:
> +	pte_unmap_unlock(ptep_start, state.ptl);
> +	return ret;
> +}
> +
> +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> +{
> +	pud_t *pudp;
> +	pmd_t *pmdp;
> +	unsigned long extent;
> +	struct mm_struct *mm = current->mm;
> +
> +	if (!pmc->len_in)
> +		return true;
> +
> +	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
> +		pmd_t pmd;
> +		pud_t pud;
> +
> +		extent = get_extent(NORMAL_PUD, pmc);
> +
> +		pudp = get_old_pud(mm, pmc->old_addr);
> +		if (!pudp)
> +			continue;
> +		pud = pudp_get(pudp);
> +
> +		if (pud_trans_huge(pud) || pud_devmap(pud))
> +			return false;

We don't support PUD-size THP, why to we have to fail here?

> +
> +		extent = get_extent(NORMAL_PMD, pmc);
> +		pmdp = get_old_pmd(mm, pmc->old_addr);
> +		if (!pmdp)
> +			continue;
> +		pmd = pmdp_get(pmdp);
> +
> +		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
> +		    pmd_devmap(pmd))
> +			return false;

Okay, this case could likely be handled later (present anon folio or 
migration entry; everything else, we can skip).

> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
> +			return false;
> +	}
 > +> +	return true;
> +}
> +
> +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> +{
> +	unsigned long old_addr = pmc->old_addr;
> +	unsigned long new_addr = pmc->new_addr;
> +	bool ret;
> +
> +	ret = __relocate_anon_folios(pmc, undo);
> +
> +	/* Reset state ready for retry. */
> +	pmc->old_addr = old_addr;
> +	pmc->new_addr = new_addr;
> +
> +	return ret;
> +}
> +
>   unsigned long move_page_tables(struct pagetable_move_control *pmc)
>   {
>   	unsigned long extent;
> @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
>   	}
>   }
>   
> +/*
> + * Should we attempt to relocate anonymous folios to the location that the VMA
> + * is being moved to by updating index and mapping fields accordingly?
> + */
> +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> +	struct pagetable_move_control *pmc)
> +{
> +	struct vm_area_struct *old = vrm->vma;
> +
> +	/* Currently we only do this if requested. */
> +	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> +		return false;
> +
> +	/* We can't deal with special or hugetlb mappings. */
> +	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> +		return false;
> +
> +	/* We only support anonymous mappings. */
> +	if (!vma_is_anonymous(old))
> +		return false;

I suspect MAP_PRIVATE file mappings should be easy to extend?

[...]

>   	pmc.new = new_vma;
>   
> +	if (relocate_anon) {
> +		lock_new_anon_vma(new_vma);
> +		pmc.relocate_locked = new_vma;
> +
> +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> +			unsigned long start = new_vma->vm_start;
> +			unsigned long size = new_vma->vm_end - start;
> +
> +			/* Undo if fails. */
> +			relocate_anon_folios(&pmc, /* undo= */true);

You'd assume this cannot fail, but I think it can: imagine concurrent 
GUP-fast ...

I really wish we can find a way to not require the fallback.

-- 
Cheers,

David / dhildenb