linux-kernel - vfork(2) behavior not consistent with fork(2) (was: vfork(2) fails after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not document corner case when PID == 1))

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ae2cbf67-aace-bc40-418e-7b41873f814a@gmail.com>
Date:   Tue, 5 Apr 2022 21:28:12 +0200
From:   Alejandro Colomar <alx.manpages@...il.com>
To:     Christian Brauner <brauner@...nel.org>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Коренберг Марк 
        <socketpair@...il.com>, Andrei Vagin <avagin@...nvz.org>,
        Dmitry Safonov <dima@...sta.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Arnd Bergmann <arnd@...db.de>, Serge Hallyn <serge@...lyn.com>,
        bugzilla-daemon@...nel.org
Subject: vfork(2) behavior not consistent with fork(2) (was: vfork(2) fails
 after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not
 document corner case when PID == 1))

Hey, Christian!

On 4/4/22 10:05, Christian Brauner wrote:
> On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) wrote:
>> [Added some kernel CCs that may know what's going on]
[...]
>> Maybe someone in the kernel can send some patch for the clone(2) and/or
>> vfork(2) manual pages that explains the reason (if it's intended).
> 
> Hey Alejandro,
> 
> I won't be able to send a patch very soon but I can at least explain why
> you see EINVAL. :)

Don't hurry, we're not planning to release any soon :)

> 
> This is intended.
> 
> vfork() suspends the parent process and the child process will share the
> same vm as the parent process. If the child process is in a new time
> namespace different from its parent process it is not allowed to be in
> the same threadgroup or share virtual memory with the parent process.
> That's why you see EINVAL.

That makes a lot of sense to me.

> 
> Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling
> process to be moved into a different time namespace. Only the newly
> created child process will be after a subsequent
> fork()/vfork()/clone()/clone3()...
> 
> The semantics are equivalent to that of CLONE_NEWPID in this regard. You
> can see this via /proc/<pid>/ns/ where you see two entries for pid
> namespaces and also two entries for time namespaces:
> 
> * CLONE_NEWTIME
>    * /proc/<pid>/ns/time			// current time namespace
>    * /proc/<pid>/ns/time_for_children	// time namespace for the new child process

Also makes sense.  Michael taught me that a few weeks ago :)

This also triggers some doubt:  will the same problem happen with 
CLONE_NEWPID since it also moves the child into a new ns (in this case a 
PID one)?  See test program below.

> 
> If during fork:
> 
> parent_process->time != parent_process->time_for_children
> 
> and either CLONE_VM or CLONE_THREAD is set you see EINVAL.
> 
> You can thus replicate the same error via:
> 
> unshare(CLONE_NEWTIME)
> 
> and a
> 
> clone() or clone3() call with CLONE_VM or CLONE_THREAD.

So, to test my doubts, I wrote this similar program (and also similar 
programs where only the CLONE_NEW* flag was changed, one with 
CLONE_NEWTIME, and one with CLONE_NEWNS)):

$ cat vfork_newpid.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <linux/sched.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

static char *const child_argv[] = {
	"print_pid",
	NULL
};

static char *const child_envp[] = {
	NULL
};

int
main(void)
{
	pid_t pid;

	printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());

	if (unshare(CLONE_NEWPID) == -1)
		err(EXIT_FAILURE, "unshare(2)");
	if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
		err(EXIT_FAILURE, "signal(2)");

	pid = syscall(SYS_vfork);
	//pid = vfork();  // This behaves differently.
	switch (pid) {
	case 0:
		execve("/home/alx/tmp/print_pid", child_argv, child_envp);
		err(EXIT_SUCCESS, "PID %jd exiting after execve(2)",
		    (long) getpid());
	case -1:
		err(EXIT_FAILURE, "vfork(2)");
	default:
		errx(EXIT_SUCCESS, "Parent exiting after vfork(2).");
	}
}

$ cat print_pid.c
#include <err.h>
#include <stdlib.h>
#include <unistd.h>

int
main(void)
{
	errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid());
}

$ cc -Wall -Wextra -Werror -o print_pid print_pid.c
$ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c
$
$
$ sudo ./vfork_newpid
vfork_newpid: PID: 8479
vfork_newpid: PID 8479 exiting after execve(2): Success
print_pid: PID 1 exiting.
$
$
$ sudo ./vfork_newtime
vfork_newtime: PID: 8484
vfork_newtime: vfork(2): Invalid argument
$
$
$ sudo ./vfork_newns
vfork_newns: PID: 8486
vfork_newns: PID 8486 exiting after execve(2): Success
print_pid: PID 8487 exiting.


The first thing I noted is that usage of vfork(2) differs considerably 
from fork(2), and that's something that's not clear by reading the 
manual page.  It sais that the parent process is suspended until the 
child calls execve(2), but I expected it to mean that vfork(2) doesn't 
return to the parent until that happened, but was otherwise transparent. 
  I was wrong and my tests showed me that.

I was going to propose an example program for the manual page, when I 
decided to try a slightly different thing: call vfork() instead of 
syscall(SYS_vfork);  that changed the behavior to the same one as with 
fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the 
child.

Is that also intended?  I couldn't find the glibc wrapper source code, 
so I don't know what is glibc doing here, but I straced the processes, 
and they're all calling vfork(), so the behavior should be consistent; 
it's quite weird.  I'm very confused at this point.


I'm also wondering why it's okay to have processes in different PID ns 
share the same vm, but I guess that's implementation details that I 
don't need to care that much.


Thanks for the details!

Cheers,

Alex