linux-kernel - Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d51af1de-110a-4cde-9091-98e15367dda3@lucifer.local>
Date: Tue, 17 Jun 2025 11:07:21 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Matthew Wilcox <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
        Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
        Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
        Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> >
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> >
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> >
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> >
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> >
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> >
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> >
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> >
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> >
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> >
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> >
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> >
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> >
> > We restrict this flag to purely anonymous memory only.
> >
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> >
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
>
> Som quick feedback, I did not yet digest everything.

Thanks for taking a look! Appreciated :)

>
> [...]
>
> > +/*
> > + * If the folio mapped at the specified pte entry can have its index and mapping
> > + * relocated, then do so.
> > + *
> > + * Returns the number of pages we have traversed, or 0 if the operation failed.
> > + */
> > +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> > +		struct pte_state *state, bool undo)
> > +{
> > +	struct folio *folio;
> > +	struct vm_area_struct *old, *new;
> > +	pgoff_t new_index;
> > +	pte_t pte;
> > +	unsigned long ret = 1;
> > +	unsigned long old_addr = state->old_addr;
> > +	unsigned long new_addr = state->new_addr;
> > +
> > +	old = pmc->old;
> > +	new = pmc->new;
> > +
> > +	pte = ptep_get(state->ptep);
> > +
> > +	/* Ensure we have truly got an anon folio. */
> > +	folio = vm_normal_folio(old, old_addr, pte);
> > +	if (!folio)
> > +		return ret;
> > +
> > +	folio_lock(folio);
> > +
> > +	/* No-op. */
> > +	if (!folio_test_anon(folio) || folio_test_ksm(folio))
> > +		goto out;
> > +
>
> So these cases are all "pass".

Yeah, this is maybe not entirely clear. But it's more like 'we don't need to do
anything with these'.

Really we shouldn't be encountering non-anon folios here given we've checked the
VMA but if we do, somehow, then nothing to do.

>
> > +	/*> +	 * This should never be the case as we have already checked to
> ensure
> > +	 * that the anon_vma is not forked, and we have just asserted that it is
> > +	 * anonymous.
> > +	 */
> > +	if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
> > +		goto out;
>
> Good a warning, so we should be able to handle that early.

:)

>
> > +	/* The above check should imply these. */
> > +	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
> > +	VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
>
> This can trigger in one nasty case, where we can lose the PAE bit during
> swapin (refault from the swapcache while the folio is under writeback, and
> the device does not allow for modifying the data while under writeback).

Ugh god wasn't aware of that. So maybe drop this second one?

>
> > +
> > +	/*
> > +	 * A pinned folio implies that it will be used for a duration longer
> > +	 * than that over which the mmap_lock is held, meaning that another part
> > +	 * of the kernel may be making use of this folio.
> > +	 *
> > +	 * Since we are about to manipulate index & mapping fields, we cannot
> > +	 * safely proceed because whatever has pinned this folio may then
> > +	 * incorrectly assume these do not change.
> > +	 */
> > +	if (folio_maybe_dma_pinned(folio))
> > +		goto out;
>
> As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
> moving these.

I'm guessing you mean as discussed below? :P Or in the cover letter I've not
read yet? :P

Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
via rmap logic.

I will audit access of these fields just to be safe.

>
> (after all we still have ordinary GUP that would also not be covered by this
> check)
>
> > +
> > +	/*
> > +	 * This should not happen as we explicitly disallow this, but check
> > +	 * anyway.
> > +	 */
> > +	if (folio_test_large(folio)) {
> > +		ret = 0;
> > +		goto out;
> > +	}
>
> That is the only real problem for rollback so far I assume.

Well, this becomes ultimately irrelevant in a later patch where we indeed
support large folios.

>
> > +
> > +	if (!undo)
> > +		new_index = linear_page_index(new, new_addr);
> > +	else
> > +		new_index = linear_page_index(old, old_addr);
> > +
> > +	/*
> > +	 * The PTL should keep us safe from unmapping, and the fact the folio is
> > +	 * a PTE keeps the folio referenced.
> > +	 *
> > +	 * The mmap/VMA locks should keep us safe from fork and other processes.
> > +	 *
> > +	 * The rmap locks should keep us safe from anything happening to the
> > +	 * VMA/anon_vma.
> > +	 *
> > +	 * The folio lock should keep us safe from reclaim, migration, etc.
> > +	 */
> > +	folio_move_anon_rmap(folio, undo ? old : new);
> > +	WRITE_ONCE(folio->index, new_index);
> > +
> > +out:
> > +	folio_unlock(folio);
> > +	return ret;
> > +}
> > +
> > +static bool pte_done(struct pte_state *state)
> > +{
> > +	return state->old_addr >= state->old_end;
> > +}
> > +
> > +static void pte_next(struct pte_state *state, unsigned long nr_pages)
> > +{
> > +	state->old_addr += nr_pages * PAGE_SIZE;
> > +	state->new_addr += nr_pages * PAGE_SIZE;
> > +	state->ptep += nr_pages;
> > +}
> > +
> > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> > +		unsigned long extent, pmd_t *pmdp, bool undo)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct pte_state state = {
> > +		.old_addr = pmc->old_addr,
> > +		.new_addr = pmc->new_addr,
> > +		.old_end = pmc->old_addr + extent,
> > +	};
> > +	pte_t *ptep_start;
> > +	bool ret;
> > +	unsigned long nr_pages;
> > +
> > +	ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
> > +	/*
> > +	 * We prevent faults with mmap write lock, hold the rmap lock and should
> > +	 * not fail to obtain this lock. Just give up if we can't.
> > +	 */
> > +	if (!ptep_start)
> > +		return false;
> > +
> > +	state.ptep = ptep_start;
> > +	for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> > +		pte_t pte = ptep_get(state.ptep);
> > +
> > +		if (pte_none(pte) || !pte_present(pte)) {
> > +			nr_pages = 1;
>
> What if we have
>
> (a) A migration entry (possibly we might fail migration and simply remap the
> original folio)
>
> (b) A swap entry with a folio in the swapcache that we can refault.
>
> I don't think we can simply skip these ...

Good point... will investigate these cases.

>
> > +			continue;
> > +		}
> > +
> > +		nr_pages = relocate_anon_pte(pmc, &state, undo);
> > +		if (!nr_pages) {
> > +			ret = false;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	ret = true;
> > +out:
> > +	pte_unmap_unlock(ptep_start, state.ptl);
> > +	return ret;
> > +}
> > +
> > +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > +	pud_t *pudp;
> > +	pmd_t *pmdp;
> > +	unsigned long extent;
> > +	struct mm_struct *mm = current->mm;
> > +
> > +	if (!pmc->len_in)
> > +		return true;
> > +
> > +	for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
> > +		pmd_t pmd;
> > +		pud_t pud;
> > +
> > +		extent = get_extent(NORMAL_PUD, pmc);
> > +
> > +		pudp = get_old_pud(mm, pmc->old_addr);
> > +		if (!pudp)
> > +			continue;
> > +		pud = pudp_get(pudp);
> > +
> > +		if (pud_trans_huge(pud) || pud_devmap(pud))
> > +			return false;
>
> We don't support PUD-size THP, why to we have to fail here?

This is just to be in line with other 'magical future where we have PUD THP'
stuff in mremap.c.

A later commit that permits huge folio support actually lets us support these...

>
> > +
> > +		extent = get_extent(NORMAL_PMD, pmc);
> > +		pmdp = get_old_pmd(mm, pmc->old_addr);
> > +		if (!pmdp)
> > +			continue;
> > +		pmd = pmdp_get(pmdp);
> > +
> > +		if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
> > +		    pmd_devmap(pmd))
> > +			return false;
>
> Okay, this case could likely be handled later (present anon folio or
> migration entry; everything else, we can skip).

Hmm, but how? the PMD cannot be traversed in this case?

'Present' migration entry? Migration entries are non-present right? :) Or is it
different at PMD?

>
> > +
> > +		if (pmd_none(pmd))
> > +			continue;
> > +
> > +		if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
> > +			return false;
> > +	}
> > +> +	return true;
> > +}
> > +
> > +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > +	unsigned long old_addr = pmc->old_addr;
> > +	unsigned long new_addr = pmc->new_addr;
> > +	bool ret;
> > +
> > +	ret = __relocate_anon_folios(pmc, undo);
> > +
> > +	/* Reset state ready for retry. */
> > +	pmc->old_addr = old_addr;
> > +	pmc->new_addr = new_addr;
> > +
> > +	return ret;
> > +}
> > +
> >   unsigned long move_page_tables(struct pagetable_move_control *pmc)
> >   {
> >   	unsigned long extent;
> > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >   	}
> >   }
> > +/*
> > + * Should we attempt to relocate anonymous folios to the location that the VMA
> > + * is being moved to by updating index and mapping fields accordingly?
> > + */
> > +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> > +	struct pagetable_move_control *pmc)
> > +{
> > +	struct vm_area_struct *old = vrm->vma;
> > +
> > +	/* Currently we only do this if requested. */
> > +	if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> > +		return false;
> > +
> > +	/* We can't deal with special or hugetlb mappings. */
> > +	if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> > +		return false;
> > +
> > +	/* We only support anonymous mappings. */
> > +	if (!vma_is_anonymous(old))
> > +		return false;
>
> I suspect MAP_PRIVATE file mappings should be easy to extend?

Yeah, but perhaps best to be conservative at first.

>
> [...]
>
> >   	pmc.new = new_vma;
> > +	if (relocate_anon) {
> > +		lock_new_anon_vma(new_vma);
> > +		pmc.relocate_locked = new_vma;
> > +
> > +		if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > +			unsigned long start = new_vma->vm_start;
> > +			unsigned long size = new_vma->vm_end - start;
> > +
> > +			/* Undo if fails. */
> > +			relocate_anon_folios(&pmc, /* undo= */true);
>
> You'd assume this cannot fail, but I think it can: imagine concurrent
> GUP-fast ...

Well if we change the racey code to ignore DMA pinned we should be ok right?

>
> I really wish we can find a way to not require the fallback.

Yeah the fallback is horrible but we really do need it. See the page table move
fallback code for nightmares also :)

We could also alternatively:

- Have some kind of anon_vma fragmentation where some folios in range reference
  a different anon_vma that we link to the original VMA (quite possibly very
  broken though).

- Keep a track of folios somehow and separate them from the page table walk (but
  then we risk races)

- Have some way of telling the kernel that such a situation exists with a new
  object that can be pointed to by folio->mapping, that the rmap code recognise,
  like essentially an 'anon_vma migration entry' which can fail.

I already considered combining this operation with the page table move
operation, but the locking gets horrible and the undo is categorically much
worse and I'm not sure it's actually workable.

>
> --
> Cheers,
>
> David / dhildenb
>