linux-kernel - Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <t5uqs6kbzmcl2sjplxa5tqy6luinuysi7lfimbademagop7323@gveunpi3eqyo>
Date: Tue, 3 Jun 2025 21:32:25 +0100
From: Pedro Falcato <pfalcato@...e.de>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org, Peter Xu <peterx@...hat.com>, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory
 snapshot

On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> When fork() encounters possibly-pinned pages, those pages are immediately
> copied instead of just marking PTEs to make CoW happen later. If the parent
> is multithreaded, this can cause the child to see memory contents that are
> inconsistent in multiple ways:
> 
> 1. We are copying the contents of a page with a memcpy() while userspace
>    may be writing to it. This can cause the resulting data in the child to
>    be inconsistent.

This is an interesting problem, but we'll get to it later.

> 2. After we've copied this page, future writes to other pages may
>    continue to be visible to the child while future writes to this page are
>    no longer visible to the child.
>

Yes, and this is not fixable. It's also a problem for the regular write-protect
pte path where inevitably only a part of the address space will be write-protected.
This would only be fixable if e.g we suspended every thread on a multi-threaded fork.

> This means the child could theoretically see incoherent states where
> allocator freelists point to objects that are actually in use or stuff like
> that. A mitigating factor is that, unless userspace already has a deadlock
> bug, userspace can pretty much only observe such issues when fancy lockless
> data structures are used (because if another thread was in the middle of
> mutating data during fork() and the post-fork child tried to take the mutex
> protecting that data, it might wait forever).
> 

Ok, so the issue here is that atomics + memcpy (or our kernel variants) will
possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_
tell us anything about this. _However_:

POSIX says:
> Any locks held by any thread in the calling process that have been set to be process-shared
> shall not be held by the child process. For locks held by any thread in the calling process
> that have not been set to be process-shared, any attempt by the child process to perform
> any operation on the lock results in undefined behavior (regardless of whether the calling
> process is single-threaded or multi-threaded).

The interesting bit here is "For locks held by any thread [...] any attempt by
the child [...] results in UB". I don't think it's entirely far-fetched to say
the spirit of the law is that atomics may also be UB (just like a lock[1] that was
held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).

In any way, I think the bottom-line is that fork memory snapshot coherency is
a fallacy. It's really impossible to reach without adding insane constraints
(like the aforementioned thread suspending + resume). It's not even possible
when going through normal write-protect paths that have been conceptually stable since
the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).

Thus, personally I don't think this is worth fixing.

[1] This (at least in theory) covers every lock, so it also encompasses pthread spinlocks

-- 
Pedro