lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com>
Date: Tue, 03 Jun 2025 20:21:03 +0200
From: Jann Horn <jannh@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>, 
 David Hildenbrand <david@...hat.com>, 
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
 Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, 
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>, 
 linux-mm@...ck.org
Cc: Peter Xu <peterx@...hat.com>, linux-kernel@...r.kernel.org, 
 Jann Horn <jannh@...gle.com>
Subject: [PATCH 2/2] mm/memory: Document how we make a coherent memory
 snapshot

It is not currently documented that the child of fork() should receive a
coherent snapshot of the parent's memory, or how we get such a snapshot.
Add a comment block to explain this.

Signed-off-by: Jann Horn <jannh@...gle.com>
---
 kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 85afccfdf3b1..f78f5df596a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 }
 
 #ifdef CONFIG_MMU
+/*
+ * Anonymous memory inherited by the child MM must, on success, contain a
+ * coherent snapshot of corresponding anonymous memory in the parent MM.
+ * (An exception are anonymous memory regions which are concurrently written
+ * by kernel code or hardware devices through page references obtained via GUP.)
+ * We effectively snapshot the parent's memory just before
+ * mmap_write_unlock(oldmm); any writes after that point are invisible to the
+ * child, while attempted writes before that point are either visible to the
+ * child or delayed until after mmap_write_unlock(oldmm).
+ *
+ * To make that work while only needing a single pass through the parent's VMA
+ * tree and page tables, we follow these rules:
+ *
+ *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
+ *    write to copy-on-write pages anymore.
+ *  - Before dup_mmap() copies page contents (which happens rarely), the
+ *    parent's PTE for the page is made read-only and a TLB flush is issued, so
+ *    subsequent writes are delayed until mmap_write_unlock().
+ *  - Before dup_mmap() starts walking the page tables of a VMA in the parent,
+ *    the VMA is write-locked to ensure that the parent can't perform writes
+ *    that won't be visible in the child before mmap_write_unlock():
+ *      a) through concurrent copy-on-write handling
+ *      b) by upgrading read-only PTEs to writable
+ *
+ * Not following these rules, and giving the child a torn copy of the parent's
+ * memory contents where different segments come from different points in time,
+ * would likely _mostly_ work:
+ * Any memory to which a concurrent parent thread could be writing under a lock
+ * can't be accessed from the child without risking deadlocks (since the child
+ * might inherit the lock in a locked state, in which case the lock will stay
+ * locked forever in the child).
+ * But if userspace is using trylock or lock-free algorithms, providing a torn
+ * view of memory could break the child.
+ */
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
 {

-- 
2.49.0.1204.g71687c7c1d-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ