[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9923da5f8b095dd1e8d677692dcaf95859de0ef5.1768746221.git.lorenzo.stoakes@oracle.com>
Date: Sun, 18 Jan 2026 14:50:38 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Shakeel Butt <shakeel.butt@...ux.dev>,
David Hildenbrand <david@...nel.org>, Rik van Riel <riel@...riel.com>,
Harry Yoo <harry.yoo@...cle.com>, Jann Horn <jannh@...gle.com>,
Mike Rapoport <rppt@...nel.org>, Michal Hocko <mhocko@...e.com>,
Pedro Falcato <pfalcato@...e.de>, Chris Li <chriscli@...gle.com>,
Barry Song <v-songbaohua@...o.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH v3 2/9] mm/rmap: eliminate partial anon_vma tear-down in anon_vma_fork()
We have spun a web of unnecessary headaches for ourselves in
anon_vma_fork() with the introduction of the anon_vma reuse logic, as
introduced by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma
hierarchy").
When we clone anon_vma's linked to a VMA via vma->anon_vma_chain, we check
each anon_vma for specific conditions, and if met we set vma->anon_vma to
this anon_vma to indicate we will reuse it rather than allocating a new
one.
It triggers upon the first ancestor anon_vma found that possesses at most 1
child, and no active VMAs.
This was implemented such that if you continually fork and free VMAs, you
would achieve anon_vma reuse rather than continually allocating unnecessary
new anon_vma's.
This however brings an unfortunate situation should a memory allocation
fail during this process. anon_vma_fork():
1. Clones the anon_vma.
2. If no reuse (i.e. !vma->anon_vma), tries to allocate anon_vma, AVC.
3. If 2 fails, we are forced to unwind step 1 by invoking
unlink_anon_vmas(vma).
This means that we send a partially set up (i.e. invalid) VMA to
unlink_anon_vmas().
Doing this is dangerous and confusing - it is reasonable for kernel
developers to assume unlink_anon_vmas() is called on a correctly
established vma, and thus confusion arises if logic is implemented there to
account for invalid VMAs, and further development might introduce subtle
bugs.
It is especially problematic in the anon rmap implementation which is
essentially a broken abstraction.
The patch solves the issue by simply trying to allocate the anon_vma and
AVC ahead of time - i.e. optimising for the usual case - and freeing them
should reuse occur or an error arise in anon_vma_clone().
This is not egregious performance-wise, as this function is called on the
fork path which already performs a great number of allocations, and thus it
is already a slow-path in this respect.
It is additionally not egregious in terms of memory usage - the allocations
are too-small-to-fail anyway unless, for instance, a fatal signal may have
arisen, and any OOM for such tiny allocations that may arise would indicate
the system is under so much memory pressure that the associated process is
not long for this world anyway.
We also update anon_vma->num_active_vmas to 1 directly rather than
incrementing the newly allocated anon_vma's active VMA count - this makes
it clear that this detached anon_vma can have only 1 num_active_vma at this
point.
Finally we eliminate the out_error and out_error_free_anon_vma labels which
makes the logic much easier to follow. We also correct a small comment
typo.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
---
mm/rmap.c | 46 ++++++++++++++++++++++++----------------------
1 file changed, 24 insertions(+), 22 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index cdb7618c10b1..a45b011e9846 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -360,7 +360,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
{
struct anon_vma_chain *avc;
struct anon_vma *anon_vma;
- int error;
+ int rc;
/* Don't bother if the parent process has no anon_vma here. */
if (!pvma->anon_vma)
@@ -369,27 +369,35 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
/* Drop inherited anon_vma, we'll reuse existing or allocate new. */
vma->anon_vma = NULL;
+ anon_vma = anon_vma_alloc();
+ if (!anon_vma)
+ return -ENOMEM;
+ avc = anon_vma_chain_alloc(GFP_KERNEL);
+ if (!avc) {
+ put_anon_vma(anon_vma);
+ return -ENOMEM;
+ }
+
/*
* First, attach the new VMA to the parent VMA's anon_vmas,
* so rmap can find non-COWed pages in child processes.
*/
- error = anon_vma_clone(vma, pvma);
- if (error)
- return error;
-
- /* An existing anon_vma has been reused, all done then. */
- if (vma->anon_vma)
- return 0;
+ rc = anon_vma_clone(vma, pvma);
+ /* An error arose or an existing anon_vma was reused, all done then. */
+ if (rc || vma->anon_vma) {
+ put_anon_vma(anon_vma);
+ anon_vma_chain_free(avc);
+ return rc;
+ }
- /* Then add our own anon_vma. */
- anon_vma = anon_vma_alloc();
- if (!anon_vma)
- goto out_error;
- anon_vma->num_active_vmas++;
- avc = anon_vma_chain_alloc(GFP_KERNEL);
- if (!avc)
- goto out_error_free_anon_vma;
+ /*
+ * OK no reuse, so add our own anon_vma.
+ *
+ * Since it is not linked anywhere we can safely manipulate anon_vma
+ * fields without a lock.
+ */
+ anon_vma->num_active_vmas = 1;
/*
* The root anon_vma's rwsem is the lock actually used when we
* lock any of the anon_vmas in this anon_vma tree.
@@ -410,12 +418,6 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
anon_vma_unlock_write(anon_vma);
return 0;
-
- out_error_free_anon_vma:
- put_anon_vma(anon_vma);
- out_error:
- unlink_anon_vmas(vma);
- return -ENOMEM;
}
/*
--
2.52.0
Powered by blists - more mailing lists