lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aECoHDM3l2dKTfDw@x1.local>
Date: Wed, 4 Jun 2025 16:10:04 -0400
From: Peter Xu <peterx@...hat.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory
 snapshot

On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote:
> On Wed, Jun 4, 2025 at 7:04 PM Peter Xu <peterx@...hat.com> wrote:
> > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote:
> > > It is not currently documented that the child of fork() should receive a
> > > coherent snapshot of the parent's memory, or how we get such a snapshot.
> > > Add a comment block to explain this.
> > >
> > > Signed-off-by: Jann Horn <jannh@...gle.com>
> > > ---
> > >  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 85afccfdf3b1..f78f5df596a9 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
> > >  }
> > >
> > >  #ifdef CONFIG_MMU
> > > +/*
> > > + * Anonymous memory inherited by the child MM must, on success, contain a
> > > + * coherent snapshot of corresponding anonymous memory in the parent MM.
> >
> > Should we better define what is a coherent snapshot?  Or maybe avoid using
> > this term which seems to apply to the whole mm?
> >
> > I think it's at least not a snapshot of whole mm at a specific time,
> > because as long as there can be more than one concurrent writers (hence, it
> > needs to be at least 3 threads in the parent process, 1 in charge of fork),
> > this can happen:
> >
> >   parent writer 1      parent writer 2    parent fork thr
> >   ---------------      ---------------    ---------------
> >                                           wr-protect P1
> >   write P1                                                  <---- T1
> >   (trapped, didn't happen)
> >                        write PN                             <---- T2
> >                        (went through)
> >                                           ...
> >                                           wr-protect PN
> >
> > The result of above would be that child process will see a mixture of old
> > P1 (at timestamp T1) but updated P2 (timestamp T2).  I don't think it's
> > impossible that the userapp could try to serialize "write P1" and "write
> > PN" operations in a way that it would also get a surprise seeing in the
> > child PN updated but P1 didn't.
> 
> If the write at T1 hits a page fault, then it doesn't actually happen
> at T1. The write instruction starts doing something at T1, but it does
> not fully retire, and the architectural register state does not
> change, and in particular the instruction pointer does not advance
> past this instruction; just like when speculative execution is aborted
> after a branch misprediction, except that the CPU raises an exception
> and we enter the page fault handler. The write actually happens when
> the instruction is executed a second time after page fault handling
> has completed after the mmap lock is dropped. (Unless something during
> page fault handling raises a signal, in which case the instruction
> might never architecturally execute.)

Fair enough.  So maybe that's something like a best-effort whole mm
snapshot anytime happened during the fork() but before releasing mmap write
lock.

Your comment did mention one exception on the kernel, is it still pretty
easy to happen?  I'm thinking this use case of trying to load some data
from a O_DIRECT fd and then set the var to show it's loaded:

  bool data_read=0
  read(...);
  data_read=1;

Then IIUC this can happen:

    parent thread 1                        parent fork thr
    ---------------                        ---------------
    read(...)
      using O_DIRECT on priv-anon buffers P1
      pin_user_pages
                                           fork() happens
                                             Sees P1 pinned
                                             P1 early COW (child sees no data loaded)
      memcpy()
    set data_read=1
    (data_read can be a global private var on P2)
                                             P2 wr-protected (child sees data_read=1)

Hence in child even if it sees data_read=1 it is possible the buffer may be
uninitialized, or the buffer is partly loaded, still racing with the kernel
early COW.

I'm not sure if I understand it correct this time as you discussed in the
comment. If so, should we still not emphasize too much on the kernel
providing coherent mm snapshot, at least emphasize the best-effort part
(both in comment of patch 2, but also in patch subjects)?  After all, it
seems it isn't straightforward for any userapp to see when that coherency
will be violated.

>From that POV, maybe it's better we should still suggest the undefined
behavior, even if it'll recover the old behavior some existing use case?

Thanks,

-- 
Peter Xu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ