lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <efe3aogdw5wxsn46xyy2rrqui7oghyi7elam7aiv3c6o6hsfbx@ee6dayztcy2x>
Date: Mon, 11 Aug 2025 11:17:51 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>,
        Charan Teja Kalla <quic_charante@...cinc.com>,
        akpm@...ux-foundation.org, shikemeng@...weicloud.com,
        kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
        baohua@...nel.org, chrisl@...nel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff
 path

* Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [250811 09:22]:
> On Mon, Aug 11, 2025 at 03:19:53PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > > When registering vmas for uprobe, skip the vmas in an mm that is marked
> > > > > > unstable.  Modifying a vma in an unstable mm may cause issues if the mm
> > > > > > isn't fully initialised.__
> > > > > >
> > > > > > > Is there anything preventing us from just leaving a proper tree that
> > > > > > > reflects reality in place before we drop the write lock?
> > > > > >
> > > > > > When you mean proper tree, is this about the your previous question? --
> > > > > > Shouldn't we just remove anything from the tree here that was not copied
> > > > > > immediately?
> > > > >
> > > > > Commit d24062914837 (" fork: use __mt_dup() to duplicate maple tree in
> > > > > dup_mmap()") did this for efficiency, so it'd be a regression to do this.
> > > >
> > > > We're talking about the case where fork *fails*. That cannot possibly be
> > > > relevant for performance, can it? :)
> > >
> > > I think it optimises the overall operation, but as a product of that, has to
> > > handle this edge case, and that necessitated this rather horrble stuff.
> > >
> > > Obviously we don't need to optimise a 'we are about to die' case :)
> >
> > Right, so my original question was whether we could just drop all stale
> > stuff from the tree before we lift the mmap write lock, leaving only the
> > VMAs in there that we actually processed successfully.
> 
> That'd be better answered by Liam who's more familiar with it.

The short answer is no, we cannot.

But some options and questions below..

> 
> I think it may actually be difficult to do on some level or there was some
> reason we couldn't, but I may be mistaken.

Down the rabbit hole we go..

The cloning of the tree happens by copying the tree in DFS and
replacing the old nodes with new nodes.  The tree leaves end up being
copied, which contains all the vmas (unless DONT_COPY is set, so
basically always all of them..).  When the tree is copied, we have a
duplicate of the tree with pointers to all the vmas in the old process.

The way the tree fails is that we've been unable to finish cloning it,
usually for out of memory reasons.  So, this means we have a tree with
new and exciting vmas that have never been used and old but still active
vmas in oldmm.

The failure point is then marked with an XA_ZERO_ENTRY, which will
succeed in storing as it's a direct replacement in the tree so no
allocations necessary.  Thus this is safe even in -ENOMEM scenarios.

Clearing out the stale data means we may actually need to allocate to
remove vmas from the new tree, because we use allocated memory in the
maple tree - we'll need to rebalance, new parents, etc, etc.

So, to remove the stale data - we may actually have to allocate memory.

But we're most likely out of memory.. and we don't want to get the
shrinker involved in a broken task teardown, especially since it has
already been run and failed to help..

We could replace all the old vmas with XA_ZERO_ENTRY, but that doesn't
really fix this issue either.

I could make a function that frees all new vmas and destroys the tree
specifically for this failure state?

I'm almost certain this will lead to another whack-a-mole situation, but
those _should_ already be checked or written to work when there are zero
vmas in an mm (going by experience of what the scheduler does with an
empty tree).  Syzbot finds these scenarios sometimes via signals or
other corner cases that can happen..

Then again, I also thought the unstable mm should be checked where
necessary to avoid assumptions on the mm state..?

This is funny because we already have a (probably) benign race with oom
here.  This code may already visit the mm after __oom_reap_task_mm() and
the mm disappearing, but since the anon vmas should be removed,
unuse_mm() will skip them.

Although, I'm not sure what happens when
mmu_notifier_invalidate_range_start_nonblock() fails AND unuse_mm() is
called on the mm after.  Maybe checking the unstable mm is necessary
here anyways?

Thanks,
Liam


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ