[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAG48ez2FCsKNNaKa5Y0xBJTdtzptoCxM_+XNNg=bUMgoLDyC3Q@mail.gmail.com>
Date: Fri, 15 Aug 2025 21:49:19 +0200
From: Jann Horn <jannh@...gle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
maple-tree@...ts.infradead.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>,
Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>, Pedro Falcato <pfalcato@...e.de>,
Charan Teja Kalla <quic_charante@...cinc.com>, shikemeng@...weicloud.com, kasong@...cent.com,
nphamcs@...il.com, bhe@...hat.com, baohua@...nel.org, chrisl@...nel.org,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
On Fri, Aug 15, 2025 at 9:10 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> Before you read on, please take a moment to acknowledge that David
> Hildenbrand asked for this, so I'm blaming mostly him :)
>
> It is possible that the dup_mmap() call fails on allocating or setting
> up a vma after the maple tree of the oldmm is copied. Today, that
> failure point is marked by inserting an XA_ZERO entry over the failure
> point so that the exact location does not need to be communicated
> through to exit_mmap().
Overall: Yes please, I'm in favor of getting rid of that XA_ZERO special case.
> However, a race exists in the tear down process because the dup_mmap()
> drops the mmap lock before exit_mmap() can remove the partially set up
> vma tree. This means that other tasks may get to the mm tree and find
> the invalid vma pointer (since it's an XA_ZERO entry), even though the
> mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
>
> To remove the race fully, the tree must be cleaned up before dropping
> the lock. This is accomplished by extracting the vma cleanup in
> exit_mmap() and changing the required functions to pass through the vma
> search limit.
It really seems to me like, instead of tearing down the whole tree on
this failure path, we should be able to remove those entries in the
cloned vma tree that haven't been copied yet and then proceed as
normal. I understand that this is complicated because of maple tree
weirdness; but can't we somehow fix the wr_rebalance case to not
allocate more memory when reducing the number of tree nodes?
Surely there's some way to do that. A really stupid suggestion: As
long as wr_rebalance is guaranteed to not increase the number of
nodes, we could make do with a global-mutex-protected system-global
preallocation of significantly less than 64 maple tree nodes, right?
We could even use that in RCU mode, as long as we are willing to take
a synchronize_rcu() penalty on this "we really want to wipe some tree
elements" slowpath.
It feels like we're adding more and more weird contortions caused by
the kinda bizarre "you can't reliably remove tree elements" property
of maple trees, and I really feel like that complexity should be
pushed down into the maple tree implementation instead.
Powered by blists - more mailing lists