lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez29awjpSXnupQGyxCLoLds72QcYtbhmkAyLT2dCqFzA5Q@mail.gmail.com>
Date: Wed, 4 Jun 2025 17:41:47 +0200
From: Jann Horn <jannh@...gle.com>
To: Pedro Falcato <pfalcato@...e.de>
Cc: Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org, Peter Xu <peterx@...hat.com>, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot

On Tue, Jun 3, 2025 at 10:32 PM Pedro Falcato <pfalcato@...e.de> wrote:
> On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote:
> > When fork() encounters possibly-pinned pages, those pages are immediately
> > copied instead of just marking PTEs to make CoW happen later. If the parent
> > is multithreaded, this can cause the child to see memory contents that are
> > inconsistent in multiple ways:
> >
> > 1. We are copying the contents of a page with a memcpy() while userspace
> >    may be writing to it. This can cause the resulting data in the child to
> >    be inconsistent.
>
> This is an interesting problem, but we'll get to it later.
>
> > 2. After we've copied this page, future writes to other pages may
> >    continue to be visible to the child while future writes to this page are
> >    no longer visible to the child.
> >
>
> Yes, and this is not fixable. It's also a problem for the regular write-protect
> pte path where inevitably only a part of the address space will be write-protected.

I don't understand what you mean by "inevitably only a part of the
address space will be write-protected". Are you talking about how
shared pages are kept shared between parent in child? Or are you
talking about how there is a point in time at which part of the
address space is write-protected while another part is not yet
write-protected? In that case: Yes, that can happen, but that's not a
problem.

> This would only be fixable if e.g we suspended every thread on a multi-threaded fork.

No, I think it is fine to keep threads running in parallel on a
multi-threaded fork as long as all the writes they do are guaranteed
to also be observable in the child. Such writes are no different from
writes performed before fork().

It would only get problematic if something in the parent first wrote
to page A, which has already been copied to the child (so the child no
longer sees the write) and then wrote to page B, which is CoWed (so
the child would see the write). I prevent this scenario by effectively
suspending the thread that tries to write to page A until the fork is
over (by making it block on the mmap lock in the fault handling path).

> > This means the child could theoretically see incoherent states where
> > allocator freelists point to objects that are actually in use or stuff like
> > that. A mitigating factor is that, unless userspace already has a deadlock
> > bug, userspace can pretty much only observe such issues when fancy lockless
> > data structures are used (because if another thread was in the middle of
> > mutating data during fork() and the post-fork child tried to take the mutex
> > protecting that data, it might wait forever).
> >
>
> Ok, so the issue here is that atomics + memcpy (or our kernel variants) will
> possibly observe tearing. This is indeed a problem, and POSIX doesn't _really_
> tell us anything about this. _However_:
>
> POSIX says:
> > Any locks held by any thread in the calling process that have been set to be process-shared
> > shall not be held by the child process. For locks held by any thread in the calling process
> > that have not been set to be process-shared, any attempt by the child process to perform
> > any operation on the lock results in undefined behavior (regardless of whether the calling
> > process is single-threaded or multi-threaded).
>
> The interesting bit here is "For locks held by any thread [...] any attempt by
> the child [...] results in UB". I don't think it's entirely far-fetched to say
> the spirit of the law is that atomics may also be UB (just like a lock[1] that was
> held by a separate thread, then unlocked mid-concurrent-fork is in a UB state).

I think interpreting atomic operations as locks is far-fetched. Also,
POSIX is a sort of minimal bar, and if we only implemented things
explicitly required by POSIX, we might not have a particularly useful
operating system.

Besides, I think things specified by the C standard override whatever
POSIX says, and C23 specifies that there are atomic operations, and I
haven't seen anything in C23 that restricts availability of those to
before fork().

> In any way, I think the bottom-line is that fork memory snapshot coherency is
> a fallacy. It's really impossible to reach without adding insane constraints
> (like the aforementioned thread suspending + resume). It's not even possible
> when going through normal write-protect paths that have been conceptually stable since
> the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem).

No, Linux already had memory snapshot coherency before commit
70e806e4e645 ("mm: Do early cow for pinned pages during fork() for
ptes"). Write-protecting a page at a time does not cause coherency
issues, because letting a concurrent thread write into such memory
during fork() is no different from letting it do so before fork() from
a memory coherency perspective, as long as fork() write-locks memory
management for the process.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ