linux-kernel - Re: vfork(2) behavior not consistent with fork(2) (was: vfork(2) fails after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not document corner case when PID == 1))

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220406084613.3srklyt27qxcmrcx@wittgenstein>
Date:   Wed, 6 Apr 2022 10:46:13 +0200
From:   Christian Brauner <brauner@...nel.org>
To:     Alejandro Colomar <alx.manpages@...il.com>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Коренберг Марк 
        <socketpair@...il.com>, Andrei Vagin <avagin@...nvz.org>,
        Dmitry Safonov <dima@...sta.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Arnd Bergmann <arnd@...db.de>, Serge Hallyn <serge@...lyn.com>,
        bugzilla-daemon@...nel.org
Subject: Re: vfork(2) behavior not consistent with fork(2) (was: vfork(2)
 fails after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not
 document corner case when PID == 1))

On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote:
> Hey, Christian!
> 
> On 4/4/22 10:05, Christian Brauner wrote:
> > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) wrote:
> > > [Added some kernel CCs that may know what's going on]
> [...]
> > > Maybe someone in the kernel can send some patch for the clone(2) and/or
> > > vfork(2) manual pages that explains the reason (if it's intended).
> > 
> > Hey Alejandro,
> > 
> > I won't be able to send a patch very soon but I can at least explain why
> > you see EINVAL. :)
> 
> Don't hurry, we're not planning to release any soon :)
> 
> > 
> > This is intended.
> > 
> > vfork() suspends the parent process and the child process will share the
> > same vm as the parent process. If the child process is in a new time
> > namespace different from its parent process it is not allowed to be in
> > the same threadgroup or share virtual memory with the parent process.
> > That's why you see EINVAL.
> 
> That makes a lot of sense to me.
> 
> > 
> > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling
> > process to be moved into a different time namespace. Only the newly
> > created child process will be after a subsequent
> > fork()/vfork()/clone()/clone3()...
> > 
> > The semantics are equivalent to that of CLONE_NEWPID in this regard. You
> > can see this via /proc/<pid>/ns/ where you see two entries for pid
> > namespaces and also two entries for time namespaces:
> > 
> > * CLONE_NEWTIME
> >    * /proc/<pid>/ns/time			// current time namespace
> >    * /proc/<pid>/ns/time_for_children	// time namespace for the new child process
> 
> Also makes sense.  Michael taught me that a few weeks ago :)
> 
> This also triggers some doubt:  will the same problem happen with
> CLONE_NEWPID since it also moves the child into a new ns (in this case a PID
> one)?  See test program below.

No, it won't. A pid namespace places no relevant constraints on vm usage
whereas a time namespace does.
If a task joins a new time namespace it'll clean the VVAR page tables
and refault them with the new layout after the timens change. That
affects all tasks which use the same task->mm.

Since CLONE_THREAD implies CLONE_VM this would affect the whole
thread-group behind their back. All threads would suddenly change
timens.

No such issues exist for pid namespaces; they don't need to alter
task->mm.

> 
> > 
> > If during fork:
> > 
> > parent_process->time != parent_process->time_for_children
> > 
> > and either CLONE_VM or CLONE_THREAD is set you see EINVAL.
> > 
> > You can thus replicate the same error via:
> > 
> > unshare(CLONE_NEWTIME)
> > 
> > and a
> > 
> > clone() or clone3() call with CLONE_VM or CLONE_THREAD.
> 
> So, to test my doubts, I wrote this similar program (and also similar
> programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME,
> and one with CLONE_NEWNS)):
> 
> $ cat vfork_newpid.c
> #define _GNU_SOURCE
> #include <err.h>
> #include <errno.h>
> #include <linux/sched.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> static char *const child_argv[] = {
> 	"print_pid",
> 	NULL
> };
> 
> static char *const child_envp[] = {
> 	NULL
> };
> 
> int
> main(void)
> {
> 	pid_t pid;
> 
> 	printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());
> 
> 	if (unshare(CLONE_NEWPID) == -1)
> 		err(EXIT_FAILURE, "unshare(2)");
> 	if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
> 		err(EXIT_FAILURE, "signal(2)");
> 
> 	pid = syscall(SYS_vfork);
> 	//pid = vfork();  // This behaves differently.
> 	switch (pid) {
> 	case 0:
> 		execve("/home/alx/tmp/print_pid", child_argv, child_envp);
> 		err(EXIT_SUCCESS, "PID %jd exiting after execve(2)",
> 		    (long) getpid());
> 	case -1:
> 		err(EXIT_FAILURE, "vfork(2)");
> 	default:
> 		errx(EXIT_SUCCESS, "Parent exiting after vfork(2).");
> 	}
> }
> 
> $ cat print_pid.c
> #include <err.h>
> #include <stdlib.h>
> #include <unistd.h>
> 
> int
> main(void)
> {
> 	errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid());
> }
> 
> $ cc -Wall -Wextra -Werror -o print_pid print_pid.c
> $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c
> $
> $
> $ sudo ./vfork_newpid
> vfork_newpid: PID: 8479
> vfork_newpid: PID 8479 exiting after execve(2): Success
> print_pid: PID 1 exiting.
> $
> $
> $ sudo ./vfork_newtime
> vfork_newtime: PID: 8484
> vfork_newtime: vfork(2): Invalid argument
> $
> $
> $ sudo ./vfork_newns
> vfork_newns: PID: 8486
> vfork_newns: PID 8486 exiting after execve(2): Success
> print_pid: PID 8487 exiting.
> 
> 
> The first thing I noted is that usage of vfork(2) differs considerably from
> fork(2), and that's something that's not clear by reading the manual page.
> It sais that the parent process is suspended until the child calls
> execve(2), but I expected it to mean that vfork(2) doesn't return to the
> parent until that happened, but was otherwise transparent.  I was wrong and
> my tests showed me that.
> 
> I was going to propose an example program for the manual page, when I
> decided to try a slightly different thing: call vfork() instead of
> syscall(SYS_vfork);  that changed the behavior to the same one as with
> fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the
> child.
> 
> Is that also intended?  I couldn't find the glibc wrapper source code, so I
> don't know what is glibc doing here, but I straced the processes, and
> they're all calling vfork(), so the behavior should be consistent; it's
> quite weird.  I'm very confused at this point.

glibc does vfork() via inline assembly massaging. There's probably
atfork handlers and a bunch of other stuff involved so it's difficult to
do a remote diagnosis.
(And note that calling anything other than execve() or _exit() after
vfork() is basically undefined behavior.)

> 
> 
> I'm also wondering why it's okay to have processes in different PID ns share
> the same vm, but I guess that's implementation details that I don't need to
> care that much.

See earlier in the thread.