[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d51af1de-110a-4cde-9091-98e15367dda3@lucifer.local>
Date: Tue, 17 Jun 2025 11:07:21 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Matthew Wilcox <willy@...radead.org>, Pedro Falcato <pfalcato@...e.de>,
Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
Wei Yang <richard.weiyang@...il.com>, Barry Song <baohua@...nel.org>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/11] mm/mremap: introduce more mergeable mremap via
MREMAP_RELOCATE_ANON
On Mon, Jun 16, 2025 at 10:58:28PM +0200, David Hildenbrand wrote:
> On 09.06.25 15:26, Lorenzo Stoakes wrote:
> > When mremap() moves a mapping around in memory, it goes to great lengths to
> > avoid having to walk page tables as this is expensive and
> > time-consuming.
> >
> > Rather, if the VMA was faulted (that is vma->anon_vma != NULL), the virtual
> > page offset stored in the VMA at vma->vm_pgoff will remain the same, as
> > well all the folio indexes pointed at the associated anon_vma object.
> >
> > This means the VMA and page tables can simply be moved and this affects the
> > change (and if we can move page tables at a higher page table level, this
> > is even faster).
> >
> > While this is efficient, it does lead to big problems with VMA merging - in
> > essence it causes faulted anonymous VMAs to not be mergeable under many
> > circumstances once moved.
> >
> > This is limiting and leads to both a proliferation of unreclaimable,
> > unmovable kernel metadata (VMAs, anon_vma's, anon_vma_chain's) and has an
> > impact on further use of mremap(), which has a requirement that the VMA
> > moved (which can also be a partial range within a VMA) may span only a
> > single VMA.
> >
> > This makes the mergeability or not of VMAs in effect a uAPI concern.
> >
> > In some use cases, users may wish to accept the overhead of actually going
> > to the trouble of updating VMAs and folios to affect mremap() moves. Let's
> > provide them with the choice.
> >
> > This patch add a new MREMAP_RELOCATE_ANON flag to do just that, which
> > attempts to perform such an operation. If it is unable to do so, it cleanly
> > falls back to the usual method.
> >
> > It carefully takes the rmap locks such that at no time will a racing rmap
> > user encounter incorrect or missing VMAs.
> >
> > It is also designed to interact cleanly with the existing mremap() error
> > fallback mechanism (inverting the remap should the page table move fail).
> >
> > Also, if we could merge cleanly without such a change, we do so, avoiding
> > the overhead of the operation if it is not required.
> >
> > In the instance that no merge may occur when the move is performed, we
> > still perform the folio and VMA updates to ensure that future mremap() or
> > mprotect() calls will result in merges.
> >
> > In this implementation, we simply give up if we encounter large folios. A
> > subsequent commit will extend the functionality to allow for these cases.
> >
> > We restrict this flag to purely anonymous memory only.
> >
> > we separate out the vma_had_uncowed_parents() helper function for checking
> > in should_relocate_anon() and introduce a new function
> > vma_maybe_has_shared_anon_folios() which combines a check against this and
> > any forked child anon_vma's.
> >
> > We carefully check for pinned folios in case a caller who holds a pin might
> > make assumptions about index, mapping fields which we are about to
> > manipulate.
>
> Som quick feedback, I did not yet digest everything.
Thanks for taking a look! Appreciated :)
>
> [...]
>
> > +/*
> > + * If the folio mapped at the specified pte entry can have its index and mapping
> > + * relocated, then do so.
> > + *
> > + * Returns the number of pages we have traversed, or 0 if the operation failed.
> > + */
> > +static unsigned long relocate_anon_pte(struct pagetable_move_control *pmc,
> > + struct pte_state *state, bool undo)
> > +{
> > + struct folio *folio;
> > + struct vm_area_struct *old, *new;
> > + pgoff_t new_index;
> > + pte_t pte;
> > + unsigned long ret = 1;
> > + unsigned long old_addr = state->old_addr;
> > + unsigned long new_addr = state->new_addr;
> > +
> > + old = pmc->old;
> > + new = pmc->new;
> > +
> > + pte = ptep_get(state->ptep);
> > +
> > + /* Ensure we have truly got an anon folio. */
> > + folio = vm_normal_folio(old, old_addr, pte);
> > + if (!folio)
> > + return ret;
> > +
> > + folio_lock(folio);
> > +
> > + /* No-op. */
> > + if (!folio_test_anon(folio) || folio_test_ksm(folio))
> > + goto out;
> > +
>
> So these cases are all "pass".
Yeah, this is maybe not entirely clear. But it's more like 'we don't need to do
anything with these'.
Really we shouldn't be encountering non-anon folios here given we've checked the
VMA but if we do, somehow, then nothing to do.
>
> > + /*> + * This should never be the case as we have already checked to
> ensure
> > + * that the anon_vma is not forked, and we have just asserted that it is
> > + * anonymous.
> > + */
> > + if (WARN_ON_ONCE(folio_maybe_mapped_shared(folio)))
> > + goto out;
>
> Good a warning, so we should be able to handle that early.
:)
>
> > + /* The above check should imply these. */
> > + VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
> > + VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
>
> This can trigger in one nasty case, where we can lose the PAE bit during
> swapin (refault from the swapcache while the folio is under writeback, and
> the device does not allow for modifying the data while under writeback).
Ugh god wasn't aware of that. So maybe drop this second one?
>
> > +
> > + /*
> > + * A pinned folio implies that it will be used for a duration longer
> > + * than that over which the mmap_lock is held, meaning that another part
> > + * of the kernel may be making use of this folio.
> > + *
> > + * Since we are about to manipulate index & mapping fields, we cannot
> > + * safely proceed because whatever has pinned this folio may then
> > + * incorrectly assume these do not change.
> > + */
> > + if (folio_maybe_dma_pinned(folio))
> > + goto out;
>
> As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
> moving these.
I'm guessing you mean as discussed below? :P Or in the cover letter I've not
read yet? :P
Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
via rmap logic.
I will audit access of these fields just to be safe.
>
> (after all we still have ordinary GUP that would also not be covered by this
> check)
>
> > +
> > + /*
> > + * This should not happen as we explicitly disallow this, but check
> > + * anyway.
> > + */
> > + if (folio_test_large(folio)) {
> > + ret = 0;
> > + goto out;
> > + }
>
> That is the only real problem for rollback so far I assume.
Well, this becomes ultimately irrelevant in a later patch where we indeed
support large folios.
>
> > +
> > + if (!undo)
> > + new_index = linear_page_index(new, new_addr);
> > + else
> > + new_index = linear_page_index(old, old_addr);
> > +
> > + /*
> > + * The PTL should keep us safe from unmapping, and the fact the folio is
> > + * a PTE keeps the folio referenced.
> > + *
> > + * The mmap/VMA locks should keep us safe from fork and other processes.
> > + *
> > + * The rmap locks should keep us safe from anything happening to the
> > + * VMA/anon_vma.
> > + *
> > + * The folio lock should keep us safe from reclaim, migration, etc.
> > + */
> > + folio_move_anon_rmap(folio, undo ? old : new);
> > + WRITE_ONCE(folio->index, new_index);
> > +
> > +out:
> > + folio_unlock(folio);
> > + return ret;
> > +}
> > +
> > +static bool pte_done(struct pte_state *state)
> > +{
> > + return state->old_addr >= state->old_end;
> > +}
> > +
> > +static void pte_next(struct pte_state *state, unsigned long nr_pages)
> > +{
> > + state->old_addr += nr_pages * PAGE_SIZE;
> > + state->new_addr += nr_pages * PAGE_SIZE;
> > + state->ptep += nr_pages;
> > +}
> > +
> > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> > + unsigned long extent, pmd_t *pmdp, bool undo)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + struct pte_state state = {
> > + .old_addr = pmc->old_addr,
> > + .new_addr = pmc->new_addr,
> > + .old_end = pmc->old_addr + extent,
> > + };
> > + pte_t *ptep_start;
> > + bool ret;
> > + unsigned long nr_pages;
> > +
> > + ptep_start = pte_offset_map_lock(mm, pmdp, pmc->old_addr, &state.ptl);
> > + /*
> > + * We prevent faults with mmap write lock, hold the rmap lock and should
> > + * not fail to obtain this lock. Just give up if we can't.
> > + */
> > + if (!ptep_start)
> > + return false;
> > +
> > + state.ptep = ptep_start;
> > + for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> > + pte_t pte = ptep_get(state.ptep);
> > +
> > + if (pte_none(pte) || !pte_present(pte)) {
> > + nr_pages = 1;
>
> What if we have
>
> (a) A migration entry (possibly we might fail migration and simply remap the
> original folio)
>
> (b) A swap entry with a folio in the swapcache that we can refault.
>
> I don't think we can simply skip these ...
Good point... will investigate these cases.
>
> > + continue;
> > + }
> > +
> > + nr_pages = relocate_anon_pte(pmc, &state, undo);
> > + if (!nr_pages) {
> > + ret = false;
> > + goto out;
> > + }
> > + }
> > +
> > + ret = true;
> > +out:
> > + pte_unmap_unlock(ptep_start, state.ptl);
> > + return ret;
> > +}
> > +
> > +static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > + pud_t *pudp;
> > + pmd_t *pmdp;
> > + unsigned long extent;
> > + struct mm_struct *mm = current->mm;
> > +
> > + if (!pmc->len_in)
> > + return true;
> > +
> > + for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
> > + pmd_t pmd;
> > + pud_t pud;
> > +
> > + extent = get_extent(NORMAL_PUD, pmc);
> > +
> > + pudp = get_old_pud(mm, pmc->old_addr);
> > + if (!pudp)
> > + continue;
> > + pud = pudp_get(pudp);
> > +
> > + if (pud_trans_huge(pud) || pud_devmap(pud))
> > + return false;
>
> We don't support PUD-size THP, why to we have to fail here?
This is just to be in line with other 'magical future where we have PUD THP'
stuff in mremap.c.
A later commit that permits huge folio support actually lets us support these...
>
> > +
> > + extent = get_extent(NORMAL_PMD, pmc);
> > + pmdp = get_old_pmd(mm, pmc->old_addr);
> > + if (!pmdp)
> > + continue;
> > + pmd = pmdp_get(pmdp);
> > +
> > + if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
> > + pmd_devmap(pmd))
> > + return false;
>
> Okay, this case could likely be handled later (present anon folio or
> migration entry; everything else, we can skip).
Hmm, but how? the PMD cannot be traversed in this case?
'Present' migration entry? Migration entries are non-present right? :) Or is it
different at PMD?
>
> > +
> > + if (pmd_none(pmd))
> > + continue;
> > +
> > + if (!relocate_anon_ptes(pmc, extent, pmdp, undo))
> > + return false;
> > + }
> > +> + return true;
> > +}
> > +
> > +static bool relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
> > +{
> > + unsigned long old_addr = pmc->old_addr;
> > + unsigned long new_addr = pmc->new_addr;
> > + bool ret;
> > +
> > + ret = __relocate_anon_folios(pmc, undo);
> > +
> > + /* Reset state ready for retry. */
> > + pmc->old_addr = old_addr;
> > + pmc->new_addr = new_addr;
> > +
> > + return ret;
> > +}
> > +
> > unsigned long move_page_tables(struct pagetable_move_control *pmc)
> > {
> > unsigned long extent;
> > @@ -1134,6 +1380,67 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> > }
> > }
> > +/*
> > + * Should we attempt to relocate anonymous folios to the location that the VMA
> > + * is being moved to by updating index and mapping fields accordingly?
> > + */
> > +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> > + struct pagetable_move_control *pmc)
> > +{
> > + struct vm_area_struct *old = vrm->vma;
> > +
> > + /* Currently we only do this if requested. */
> > + if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> > + return false;
> > +
> > + /* We can't deal with special or hugetlb mappings. */
> > + if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> > + return false;
> > +
> > + /* We only support anonymous mappings. */
> > + if (!vma_is_anonymous(old))
> > + return false;
>
> I suspect MAP_PRIVATE file mappings should be easy to extend?
Yeah, but perhaps best to be conservative at first.
>
> [...]
>
> > pmc.new = new_vma;
> > + if (relocate_anon) {
> > + lock_new_anon_vma(new_vma);
> > + pmc.relocate_locked = new_vma;
> > +
> > + if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > + unsigned long start = new_vma->vm_start;
> > + unsigned long size = new_vma->vm_end - start;
> > +
> > + /* Undo if fails. */
> > + relocate_anon_folios(&pmc, /* undo= */true);
>
> You'd assume this cannot fail, but I think it can: imagine concurrent
> GUP-fast ...
Well if we change the racey code to ignore DMA pinned we should be ok right?
>
> I really wish we can find a way to not require the fallback.
Yeah the fallback is horrible but we really do need it. See the page table move
fallback code for nightmares also :)
We could also alternatively:
- Have some kind of anon_vma fragmentation where some folios in range reference
a different anon_vma that we link to the original VMA (quite possibly very
broken though).
- Keep a track of folios somehow and separate them from the page table walk (but
then we risk races)
- Have some way of telling the kernel that such a situation exists with a new
object that can be pointed to by folio->mapping, that the rmap code recognise,
like essentially an 'anon_vma migration entry' which can fail.
I already considered combining this operation with the page table move
operation, but the locking gets horrible and the undo is categorically much
worse and I'm not sure it's actually workable.
>
> --
> Cheers,
>
> David / dhildenb
>
Powered by blists - more mailing lists