[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <nnuncvxj3p7zszgojgst4z5dv3mn3xkfymty33x3rwzopr4ecv@mev6cvnkr2wy>
Date: Mon, 11 Aug 2025 11:48:27 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Charan Teja Kalla <quic_charante@...cinc.com>,
akpm@...ux-foundation.org, shikemeng@...weicloud.com,
kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
baohua@...nel.org, chrisl@...nel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff
path
* David Hildenbrand <david@...hat.com> [250811 11:39]:
> > >
> > > I think it may actually be difficult to do on some level or there was some
> > > reason we couldn't, but I may be mistaken.
> >
>
> Thanks for the information!
>
> > Down the rabbit hole we go..
> >
> > The cloning of the tree happens by copying the tree in DFS and
> > replacing the old nodes with new nodes. The tree leaves end up being
> > copied, which contains all the vmas (unless DONT_COPY is set, so
> > basically always all of them..). When the tree is copied, we have a
> > duplicate of the tree with pointers to all the vmas in the old process.
> >
> > The way the tree fails is that we've been unable to finish cloning it,
> > usually for out of memory reasons. So, this means we have a tree with
> > new and exciting vmas that have never been used and old but still active
> > vmas in oldmm.
> >
> > The failure point is then marked with an XA_ZERO_ENTRY, which will
> > succeed in storing as it's a direct replacement in the tree so no
> > allocations necessary. Thus this is safe even in -ENOMEM scenarios.
> >
> > Clearing out the stale data means we may actually need to allocate to
> > remove vmas from the new tree, because we use allocated memory in the
> > maple tree - we'll need to rebalance, new parents, etc, etc.
> >
...
> >
> > I could make a function that frees all new vmas and destroys the tree
> > specifically for this failure state?
>
> I think the problem is that some page tables were already copied, so we
> would have to zap them as well.
>
> Maybe just factoring stuff from the exit_mmap() function could be one way to
> do it.
Yes, this is much easier now that both are in the same .c file.
..
> >
> > This is funny because we already have a (probably) benign race with oom
> > here. This code may already visit the mm after __oom_reap_task_mm() and
> > the mm disappearing, but since the anon vmas should be removed,
> > unuse_mm() will skip them.
> >
> > Although, I'm not sure what happens when
> > mmu_notifier_invalidate_range_start_nonblock() fails AND unuse_mm() is
> > called on the mm after. Maybe checking the unstable mm is necessary
> > here anyways?
>
> Can we have MMU notifiers active while the process never even ran and we are
> only halfway through duplicating VMAs?
>
I doubt it. I was thinking in other cases where the MMF_UNSTABLE flag
was set but the oom code failed to free all anon vmas based on the MMU
notifier. That is, does this code have an existing race that's much
harder to hit?
Thanks,
Liam
Powered by blists - more mailing lists