linux-kernel - Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aEB8fFEXKPR54LdA@x1.local>
Date: Wed, 4 Jun 2025 13:03:56 -0400
From: Peter Xu <peterx@...hat.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory
 snapshot

On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> It is not currently documented that the child of fork() should receive a
> coherent snapshot of the parent's memory, or how we get such a snapshot.
> Add a comment block to explain this.
> 
> Signed-off-by: Jann Horn <jannh@...gle.com>
> ---
>  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 85afccfdf3b1..f78f5df596a9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
>  }
>  
>  #ifdef CONFIG_MMU
> +/*
> + * Anonymous memory inherited by the child MM must, on success, contain a
> + * coherent snapshot of corresponding anonymous memory in the parent MM.

Should we better define what is a coherent snapshot?  Or maybe avoid using
this term which seems to apply to the whole mm?

I think it's at least not a snapshot of whole mm at a specific time,
because as long as there can be more than one concurrent writers (hence, it
needs to be at least 3 threads in the parent process, 1 in charge of fork),
this can happen:

  parent writer 1      parent writer 2    parent fork thr
  ---------------      ---------------    ---------------
                                          wr-protect P1
  write P1                                                  <---- T1
  (trapped, didn't happen)
                       write PN                             <---- T2
                       (went through)
                                          ...
                                          wr-protect PN

The result of above would be that child process will see a mixture of old
P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
impossible that the userapp could try to serialize "write P1" and "write
PN" operations in a way that it would also get a surprise seeing in the
child PN updated but P1 didn't.

I do agree it at least recovered the per-page coherence, though, no matter
what is the POSIX definition of that.  IIUC an userapp can always fix such
problem, but maybe it's too complicated in some cases, and if Linux used to
at least maintain per-page coherency, then it may make sense to recover the
behavior especially when it only affects pinned.

Said that, maybe we still want to be specific on the goal of the change.

Thanks,

> + * (An exception are anonymous memory regions which are concurrently written
> + * by kernel code or hardware devices through page references obtained via GUP.)
> + * We effectively snapshot the parent's memory just before
> + * mmap_write_unlock(oldmm); any writes after that point are invisible to the
> + * child, while attempted writes before that point are either visible to the
> + * child or delayed until after mmap_write_unlock(oldmm).
> + *
> + * To make that work while only needing a single pass through the parent's VMA
> + * tree and page tables, we follow these rules:
> + *
> + *  - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't
> + *    write to copy-on-write pages anymore.
> + *  - Before dup_mmap() copies page contents (which happens rarely), the
> + *    parent's PTE for the page is made read-only and a TLB flush is issued, so
> + *    subsequent writes are delayed until mmap_write_unlock().
> + *  - Before dup_mmap() starts walking the page tables of a VMA in the parent,
> + *    the VMA is write-locked to ensure that the parent can't perform writes
> + *    that won't be visible in the child before mmap_write_unlock():
> + *      a) through concurrent copy-on-write handling
> + *      b) by upgrading read-only PTEs to writable
> + *
> + * Not following these rules, and giving the child a torn copy of the parent's
> + * memory contents where different segments come from different points in time,
> + * would likely _mostly_ work:
> + * Any memory to which a concurrent parent thread could be writing under a lock
> + * can't be accessed from the child without risking deadlocks (since the child
> + * might inherit the lock in a locked state, in which case the lock will stay
> + * locked forever in the child).
> + * But if userspace is using trylock or lock-free algorithms, providing a torn
> + * view of memory could break the child.
> + */
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  					struct mm_struct *oldmm)
>  {
> 
> -- 
> 2.49.0.1204.g71687c7c1d-goog
> 

-- 
Peter Xu