linux-kernel - Re: Potentially undesirable interactions between vfork() and time namespaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fed91ee96b861aaf8db3d72c1b7eb135@ispras.ru>
Date:   Thu, 01 Sep 2022 18:49:39 +0300
From:   Alexey Izbyshev <izbyshev@...ras.ru>
To:     Andrei Vagin <avagin@...il.com>
Cc:     Florian Weimer <fweimer@...hat.com>,
        Christian Brauner <brauner@...nel.org>,
        Dmitry Safonov <0x7f454c46@...il.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Eric Biederman <ebiederm@...ssion.com>,
        Kees Cook <keescook@...omium.org>
Subject: Re: Potentially undesirable interactions between vfork() and time
 namespaces

On 2022-09-01 06:45, Andrei Vagin wrote:
> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <avagin@...il.com> wrote:
>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> <snip>
>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>         tsk->mm->vmacache_seqnum = 0;
>>>         vmacache_flush(tsk);
>>>         task_unlock(tsk);
>>> +
>>> +       if (vfork)
>>> +               timens_on_fork(tsk->nsproxy, tsk);
>>> +
>>> 
>>> Similarly, even after a normal vfork(), time namespace switch could 
>>> be
>>> silently skipped if the parent dies before "tsk->vfork_done" is read. 
>>> Again,
>>> I don't know whether anybody cares, but this behavior seems 
>>> non-obvious and
>>> probably unintended to me.
>> This is the more interesting case. I will try to find out how we can
>> handle it properly.
> 
> It might not be a good idea to use vfork_done in this case. Let's
> think about what we have and what we want to change. We don't want to
> allow switching timens if a process mm is used by someone else. But we
> forgot to handle execve that creates a new mm, and we can't change this
> behavior right now because it can affect current users. Right?
> 
> So maybe the best choice, in this case, is to change behavior by adding
> a new control that enables it. The first interface that comes to my 
> mind
> is to introduce a new ioctl for a namespace file descriptor. Here is a
> draft patch below that should help to understand what I mean.
> 
While I'm not a user of time namespaces (at least yet), I welcome a 
change that makes time namespace switching and inheritance semantics 
easier to understand and document. Here is my understanding of how that 
evolved.

Before the original patch that allowed vfork():

* Switching happens only on clone(~CLONE_VM).
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) (thereby 
vfork() and pthread_create() fail).
* time_ns/time_ns_for_children is preserved across execve().

After that patch:

* Switching happens on clone(~CLONE_VM).
* Switching also happens on execve() if the current task is a 
vfork-child whose creator task is still alive (because of reliance on 
"vfork_done").
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) unless it's 
clone(CLONE_VM|CLONE_VFORK), in which case time_ns/time_ns_for_children 
is inherited.
* time_ns/time_ns_for_children is preserved across execve() unless 
switched as described above.

Note that switching conditions on execve() are very subtle. Apart from 
the motivating use case of "unshare(CLONE_NEWTIME) -> vfork() -> 
execve()", it would also happen on e.g. "vfork() -> 
unshare(CLONE_NEWTIME) -> execve()", because unshare(CLONE_NEWTIME) is 
not forbidden for tasks which share mm.

With the current patch:

* Switching happens on clone(~CLONE_VM).
* Switching also happens on execve() if ioctl(TIMENS_SET_SWITCH_ON_EXEC) 
was called on time_ns_for_children.
* clone(CLONE_VM) is forbidden after unshare(CLONE_NEWTIME) unless it's 
clone(CLONE_VM|CLONE_VFORK) without CLONE_THREAD, in which case 
time_ns/time_ns_for_children is inherited. Thereby vfork() is permitted, 
while pthread_create() is not.
* time_ns/time_ns_for_children is preserved across execve() unless 
switched as described above.

So in terms of cognitive complexity it seems like a clear improvement 
that regains some of the simplicity of the initial implementation.

However, I'd like to point out that while for a narrow fix of the 
original issue (vfork() doesn't work when fork() does) time ns switching 
on execve() is not required at all, removing "automatic" switching in 
posix_spawn()-like cases could potentially surprise time namespace 
users. In the initial time ns implementation, "unshare(CLONE_NEWTIME); 
posix_spawn(...)" would either succeed with the expected effect (an 
executable is running in a new time ns) or fail, depending on whether 
posix_spawn() uses fork() or vfork(). With the first patch, vfork-based 
posix_spawn() would *usually* behave as a fork-based one (modulo the 
parent death issue). But with the current patch, unless user space is 
modified to set switch_on_exec, vfork-based posix_spawn() will succeed 
but the exe will be running in the parent's time ns. I'm not in a 
position to estimate whether any actual time ns users are affected, 
though it still looks like something that could affect *future* time ns 
users that are not careful enough.

Regarding the interface to control switching on execve(), one possible 
alternative to ioctl() is a separate file in /proc like 
/proc/$PID/setgroups that was added in a somewhat similar situation 
(fixing a problem with user namespaces implementation). Regardless of 
the interface, it'd probably be nice to also have the ability to get the 
current value of switch_on_exec flag.

Thanks,
Alexey

> ---
>  fs/exec.c                                   |  4 +---
>  fs/nsfs.c                                   |  3 +++
>  include/linux/proc_ns.h                     |  1 +
>  include/linux/time_namespace.h              |  1 +
>  include/uapi/linux/nsfs.h                   |  2 ++
>  kernel/fork.c                               |  3 ++-
>  kernel/time/namespace.c                     | 15 +++++++++++++++
>  tools/testing/selftests/timens/vfork_exec.c | 14 +++++++++++++-
>  8 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..961348084257 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct *old_mm, *active_mm;
> -	bool vfork;
>  	int ret;
> 
>  	/* Notify parent that we're no longer interested in the old VM */
>  	tsk = current;
> -	vfork = !!tsk->vfork_done;
>  	old_mm = current->mm;
>  	exec_mm_release(tsk, old_mm);
>  	if (old_mm)
> @@ -1030,7 +1028,7 @@ static int exec_mmap(struct mm_struct *mm)
>  	vmacache_flush(tsk);
>  	task_unlock(tsk);
> 
> -	if (vfork)
> +	if (READ_ONCE(tsk->nsproxy->time_ns_for_children->switch_on_exec))
>  		timens_on_fork(tsk->nsproxy, tsk);
> 
>  	if (old_mm) {
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 800c1d0eb0d0..723ab5f69bcd 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -11,6 +11,7 @@
>  #include <linux/user_namespace.h>
>  #include <linux/nsfs.h>
>  #include <linux/uaccess.h>
> +#include <linux/nsfs.h>
> 
>  #include "internal.h"
> 
> @@ -210,6 +211,8 @@ static long ns_ioctl(struct file *filp, unsigned 
> int ioctl,
>  		uid = from_kuid_munged(current_user_ns(), user_ns->owner);
>  		return put_user(uid, argp);
>  	default:
> +		if (ns->ops->ioctl)
> +			return ns->ops->ioctl(ns, ioctl,  arg);
>  		return -ENOTTY;
>  	}
>  }
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 75807ecef880..b690eb1a3468 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -22,6 +22,7 @@ struct proc_ns_operations {
>  	int (*install)(struct nsset *nsset, struct ns_common *ns);
>  	struct user_namespace *(*owner)(struct ns_common *ns);
>  	struct ns_common *(*get_parent)(struct ns_common *ns);
> +	long (*ioctl)(struct ns_common *ns, unsigned int ioctl, unsigned long 
> arg);
>  } __randomize_layout;
> 
>  extern const struct proc_ns_operations netns_operations;
> diff --git a/include/linux/time_namespace.h 
> b/include/linux/time_namespace.h
> index 3146f1c056c9..6569300d68ce 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -24,6 +24,7 @@ struct time_namespace {
>  	struct page		*vvar_page;
>  	/* If set prevents changing offsets after any task joined namespace. 
> */
>  	bool			frozen_offsets;
> +	bool			switch_on_exec;
>  } __randomize_layout;
> 
>  extern struct time_namespace init_time_ns;
> diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
> index a0c8552b64ee..ce3a9f9b1bcf 100644
> --- a/include/uapi/linux/nsfs.h
> +++ b/include/uapi/linux/nsfs.h
> @@ -16,4 +16,6 @@
>  /* Get owner UID (in the caller's user namespace) for a user namespace 
> */
>  #define NS_GET_OWNER_UID	_IO(NSIO, 0x4)
> 
> +#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)
> +
>  #endif /* __LINUX_NSFS_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 90c85b17bf69..1f7bf2a087e9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2050,7 +2050,8 @@ static __latent_entropy struct task_struct 
> *copy_process(
>  	 * On vfork, the child process enters the target time namespace only
>  	 * after exec.
>  	 */
> -	if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> +	if ((clone_flags & CLONE_THREAD) ||
> +	    (clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
>  		if (nsp->time_ns != nsp->time_ns_for_children)
>  			return ERR_PTR(-EINVAL);
>  	}
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index aec832801c26..9966e0bdefa7 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -17,6 +17,7 @@
>  #include <linux/cred.h>
>  #include <linux/err.h>
>  #include <linux/mm.h>
> +#include <linux/nsfs.h>
> 
>  #include <vdso/datapage.h>
> 
> @@ -439,6 +440,18 @@ int proc_timens_set_offset(struct file *file,
> struct task_struct *p,
>  	return err;
>  }
> 
> +static long timens_ioctl(struct ns_common *ns, unsigned int ioctl,
> unsigned long arg)
> +{
> +	struct time_namespace *time_ns = to_time_ns(ns);
> +
> +	switch (ioctl) {
> +	case TIMENS_SET_SWITCH_ON_EXEC:
> +		WRITE_ONCE(time_ns->switch_on_exec, true);
> +		return 0;
> +	}
> +	return -ENOTTY;
> +}
> +
>  const struct proc_ns_operations timens_operations = {
>  	.name		= "time",
>  	.type		= CLONE_NEWTIME,
> @@ -446,6 +459,7 @@ const struct proc_ns_operations timens_operations = 
> {
>  	.put		= timens_put,
>  	.install	= timens_install,
>  	.owner		= timens_owner,
> +	.ioctl		= timens_ioctl,
>  };
> 
>  const struct proc_ns_operations timens_for_children_operations = {
> @@ -456,6 +470,7 @@ const struct proc_ns_operations
> timens_for_children_operations = {
>  	.put		= timens_put,
>  	.install	= timens_install,
>  	.owner		= timens_owner,
> +	.ioctl		= timens_ioctl,
>  };
> 
>  struct time_namespace init_time_ns = {
> diff --git a/tools/testing/selftests/timens/vfork_exec.c
> b/tools/testing/selftests/timens/vfork_exec.c
> index e6ccd900f30a..5f4e2043e0a7 100644
> --- a/tools/testing/selftests/timens/vfork_exec.c
> +++ b/tools/testing/selftests/timens/vfork_exec.c
> @@ -12,6 +12,11 @@
>  #include <time.h>
>  #include <unistd.h>
>  #include <string.h>
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <linux/nsfs.h>
> +
> +#define TIMENS_SET_SWITCH_ON_EXEC _IO(NSIO, 0x100)
> 
>  #include "log.h"
>  #include "timens.h"
> @@ -21,7 +26,7 @@
>  int main(int argc, char *argv[])
>  {
>  	struct timespec now, tst;
> -	int status, i;
> +	int status, i, nsfd;
>  	pid_t pid;
> 
>  	if (argc > 1) {
> @@ -45,6 +50,13 @@ int main(int argc, char *argv[])
>  	if (unshare_timens())
>  		return 1;
> 
> +	nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
> +	if (nsfd < 0)
> +		return pr_perror("open");
> +	if (ioctl(nsfd, TIMENS_SET_SWITCH_ON_EXEC))
> +		return pr_perror("ioctl");
> +	close(nsfd);
> +
>  	if (_settime(CLOCK_MONOTONIC, OFFSET))
>  		return 1;