[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3iptynmztv3kdxkf3gaz6cgee2v4j3lkd7aj2cgftpgcmyvgyp@sdemowhht6zl>
Date: Mon, 18 Aug 2025 11:48:55 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
maple-tree@...ts.infradead.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>,
Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Pedro Falcato <pfalcato@...e.de>,
Charan Teja Kalla <quic_charante@...cinc.com>,
shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com,
bhe@...hat.com, baohua@...nel.org, chrisl@...nel.org,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
* Jann Horn <jannh@...gle.com> [250815 15:50]:
> On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied. Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
>
> Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.
>
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree. This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock. This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
>
> It really seems to me like, instead of tearing down the whole tree on
> this failure path, we should be able to remove those entries in the
> cloned vma tree that haven't been copied yet and then proceed as
> normal. I understand that this is complicated because of maple tree
> weirdness; but can't we somehow fix the wr_rebalance case to not
> allocate more memory when reducing the number of tree nodes?
> Surely there's some way to do that. A really stupid suggestion: As
> long as wr_rebalance is guaranteed to not increase the number of
> nodes, we could make do with a global-mutex-protected system-global
> preallocation of significantly less than 64 maple tree nodes, right?
> We could even use that in RCU mode, as long as we are willing to take
> a synchronize_rcu() penalty on this "we really want to wipe some tree
> elements" slowpath.
>
> It feels like we're adding more and more weird contortions caused by
> the kinda bizarre "you can't reliably remove tree elements" property
> of maple trees, and I really feel like that complexity should be
> pushed down into the maple tree implementation instead.
First, thanks for looking at this. I appreciate you taking the time to
think through what is going on here and alternatives.
The maple tree isn't the first time that we need memory to free memory
and I don't think it'll be the last either. I have a method of reusing
nodes, but that's when rcu is not on. What you are suggesting is very
complicated and will have a number of corner cases, and probably zero
users.
Having a global reserve of nodes is something that has come up before,
but there is no way to ensure that we'd have enough to ensure that
things won't fail. 64 is a huge number of nodes, but there's no way to
know if we're going to hit this situation multiple times in a row in
rapid succession. And since we cannot depending on it working, then
there is no real benefit to having the mode - we have to plan for the
rare case of failure regardless.
There is also another project to reduce the possibility of the maple
tree itself failing to allocate [1], but in the dup_mmap() situation the
tree is not the problem. What you are asking me to do is to add a
special mode in the tree to work around the fact that the mm struct has
a life cycle problem.
If this special mode already existed as you have suggested above, I
would advise not using it here because we are out of memory and we want
to get memory back from the failed dup_mmap() as quickly as possible,
and the tear down is the fastest way to do that. I don't want to have
code executing to juggle things around to fix an mm struct so that the
unlock/lock dance is safe for random race conditions.
This is why I think the mode would have zero users, dup_mmap() shouldn't
be doing any extra work after a failed allocation.
I see where you are coming from and that having compartmentalised code
feels like the right path, but in this case the system failed to
allocate enough memory after reclaim ran and we're looking at adding
extra run time for zero benefit. From the maple tree perspective, there
is one call to destroy the tree, the rest of the code is to clean up the
failed mm setup.
On a higher level, I'm disappointed that the mm struct is half
initialized and unlocked at all. Really, I think we should go further
and not have the mm exposed for any window of time. The issue with just
calling exit_mmap() directly is the unlocking/locking dance for
performance reasons.
Equally, the !vma issue seems crazy because it's about avoiding races
where the task has unmapped everything and we really shouldn't be able
to iterate over tasks that are in this state.
And we have this MMF_UNSTABLE bit in the mm, but other tasks walk into
the mm and access the struct without checking. That seems suboptimal,
at best.
[1] https://lore.kernel.org/all/20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz/
Thanks,
Liam
Powered by blists - more mailing lists