[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <de385044-e9a2-45a1-b74f-68ba80bba146@lucifer.local>
Date: Tue, 19 Aug 2025 21:33:59 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, maple-tree@...ts.infradead.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
Charan Teja Kalla <quic_charante@...cinc.com>,
shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com,
bhe@...hat.com, baohua@...nel.org, chrisl@...nel.org,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 6/6] mm: Change dup_mmap() recovery
On Fri, Aug 15, 2025 at 03:10:31PM -0400, Liam R. Howlett wrote:
> When the dup_mmap() fails during the vma duplication or setup, don't
> write the XA_ZERO entry in the vma tree. Instead, destroy the tree and
> free the new resources, leaving an empty vma tree.
Yesss like it!
>
> Using XA_ZERO introduced races where the vma could be found between
> dup_mmap() dropping all locks and exit_mmap() taking the locks. The
> race can occur because the mm can be reached through the other trees
> via successfully copied vmas and other methods such as the swapoff code.
Yeah god.
>
> XA_ZERO was marking the location to stop vma removal and pagetable
> freeing. The newly created arguments to the unmap_vmas() and
> free_pgtables() serve this function.
Nice.
>
> Replacing the XA_ZERO entry use with the new argument list also means
> the checks for xa_is_zero() are no longer necessary so these are also
> removed.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@...cle.com>
Ah finally the 'action' patch :)
Obviously my review is on the basis that you fixup the rebase etc.
This broadly looks right but some various nits etc. below and will have
another look through on new revision - this whole area is pretty crazy!
> ---
> mm/memory.c | 6 +-----
> mm/mmap.c | 39 ++++++++++++++++++++++++++++-----------
> 2 files changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3346514562bba..8cd58f171bae4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -387,8 +387,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> * be 0. This will underflow and is okay.
> */
> next = mas_find(mas, tree_max - 1);
> - if (unlikely(xa_is_zero(next)))
> - next = NULL;
>
> /*
> * Hide vma from rmap and truncate_pagecache before freeing
> @@ -407,8 +405,6 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> while (next && next->vm_start <= vma->vm_end + PMD_SIZE) {
> vma = next;
> next = mas_find(mas, tree_max - 1);
> - if (unlikely(xa_is_zero(next)))
> - next = NULL;
> if (mm_wr_locked)
> vma_start_write(vma);
> unlink_anon_vmas(vma);
> @@ -1978,7 +1974,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> mm_wr_locked);
> hugetlb_zap_end(vma, &details);
> vma = mas_find(mas, tree_end - 1);
> - } while (vma && likely(!xa_is_zero(vma)));
> + } while (vma);
> mmu_notifier_invalidate_range_end(&range);
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index eba2bc81bc749..5acc0b5f8c14a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1288,7 +1288,7 @@ void exit_mmap(struct mm_struct *mm)
> arch_exit_mmap(mm);
>
> vma = vma_next(&vmi);
> - if (!vma || unlikely(xa_is_zero(vma))) {
> + if (!vma) {
> /* Can happen if dup_mmap() received an OOM */
> mmap_read_unlock(mm);
> mmap_write_lock(mm);
> @@ -1858,20 +1858,37 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> ksm_fork(mm, oldmm);
> khugepaged_fork(mm, oldmm);
> } else {
> + unsigned long max;
>
> /*
> - * The entire maple tree has already been duplicated. If the
> - * mmap duplication fails, mark the failure point with
> - * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> - * stop releasing VMAs that have not been duplicated after this
> - * point.
> + * The entire maple tree has already been duplicated, but
> + * replacing the vmas failed at mpnt (which could be NULL if
> + * all were allocated but the last vma was not fully set up. Use
Missing ')'.
> + * the start address of the failure point to clean up the half
> + * initialized tree.
NIT: Is 'half' correct? Maybe 'partially'?
> */
> - if (mpnt) {
> - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> - mas_store(&vmi.mas, XA_ZERO_ENTRY);
> - /* Avoid OOM iterating a broken tree */
> - set_bit(MMF_OOM_SKIP, &mm->flags);
> + if (!mm->map_count) {
> + /* zero vmas were written to the new tree. */
> + max = 0;
OK I guess this then will result in the intentional underflow thing, maybe
worth mentioning?
> + } else if (mpnt) {
> + /* mid-tree failure */
Partial?
> + max = mpnt->vm_start;
> + } else {
> + /* All vmas were written to the new tree */
> + max = ULONG_MAX;
> }
> +
> + /* Hide mm from oom killer because the memory is being freed */
> + set_bit(MMF_OOM_SKIP, &mm->flags);
Obv. update to mm_flags_set(MMF_OOM_SKIP, mm);
> + if (max) {
> + vma_iter_set(&vmi, 0);
> + tmp = vma_next(&vmi);
> + flush_cache_mm(mm);
> + unmap_region(&vmi.mas, tmp, 0, max, max, NULL, NULL);
(An aside for the unmap_region() impl, maybe let's name the pg_max as
tree_max to make it consistent across both?)
Hm we could still use the mmap struct here if we put in vma.h. Just have to
set vmi, obv prev, next NULL.
So:
struct mmap_state map {
.vmi = &vmi,
}
unmap_region(&map, tmp, 0, max, max);
Possibly overkill and hopefully stack ok but makes other callers less
horrid.
Maybe also good to add a comment spelling out what each of these params do.
> + charge = tear_down_vmas(mm, &vmi, tmp, max);
> + vm_unacct_memory(charge);
> + }
> + __mt_destroy(&mm->mm_mt);
> /*
> * The mm_struct is going to exit, but the locks will be dropped
> * first. Set the mm_struct as unstable is advisable as it is
> --
> 2.47.2
>
Powered by blists - more mailing lists