[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YxIr15QTmYb4Uojq@gmail.com>
Date: Fri, 2 Sep 2022 09:14:15 -0700
From: Andrei Vagin <avagin@...il.com>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
Cc: Florian Weimer <fweimer@...hat.com>,
Christian Brauner <brauner@...nel.org>,
Dmitry Safonov <0x7f454c46@...il.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Kees Cook <keescook@...omium.org>,
Alexey Izbyshev <izbyshev@...ras.ru>
Subject: Re: Potentially undesirable interactions between vfork() and time
namespaces
On Thu, Sep 01, 2022 at 01:11:37PM -0500, Eric W. Biederman wrote:
> Andrei Vagin <avagin@...il.com> writes:
>
> > On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <avagin@...il.com> wrote:
> >>On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
> > <snip>
> >>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
> >>> tsk->mm->vmacache_seqnum = 0;
> >>> vmacache_flush(tsk);
> >>> task_unlock(tsk);
> >>> +
> >>> + if (vfork)
> >>> + timens_on_fork(tsk->nsproxy, tsk);
> >>> +
> >>>
> >>> Similarly, even after a normal vfork(), time namespace switch could be
> >>> silently skipped if the parent dies before "tsk->vfork_done" is read. Again,
> >>> I don't know whether anybody cares, but this behavior seems non-obvious and
> >>> probably unintended to me.
> >> This is the more interesting case. I will try to find out how we can
> >> handle it properly.
> >
> > It might not be a good idea to use vfork_done in this case. Let's
> > think about what we have and what we want to change. We don't want to
> > allow switching timens if a process mm is used by someone else. But we
> > forgot to handle execve that creates a new mm, and we can't change this
> > behavior right now because it can affect current users. Right?
>
> What we can't changes are things that will break existing programs. If
> existing programs don't care we can change the behavior of the kernel.
I agree that it is very unlikely that anyone will notice
these changes. And it is hard to imagine that anyone uses the old
behavior intentionally.
>
> > So maybe the best choice, in this case, is to change behavior by adding
> > a new control that enables it. The first interface that comes to my mind
> > is to introduce a new ioctl for a namespace file descriptor. Here is a
> > draft patch below that should help to understand what I mean.
>
> I don't think adding a new control works, because programs that are
> calling vfork or posix_spawn today will stop working.
>
> We should recognize that basing things off of CLONE_VFORK was a bad idea
> as CLONE_VFORK is all about waiting for the created task to exec or
> exit, and really has nothing to do with creating a new mm.
>
> Instead I think the rule should be that a new time namespaces is
> installed as soon as we have a new mm.
>
> That will be a behavioral change if the time ns is unshared and then the
> program exec's instead of forking children, but I suspect it is the
> proper behavior all the same, and that existing userspace won't care.
> Especially since all of the vfork_done work is new behavior as
> of v6.0-rc1.
>
> Ugh. I just spotted another bug. The function timens_on_fork as
> written is not safe to call without first creating a fresh copy
> of the nsproxy, and we don't do that during exec. Because nsproxy
> is shared between tasks and processes updating the values needs to
> create a new nsproxy or other tasks/processes can be affected.
> Not hard to handle just something that needs to be addressed.
You are right. Thanks.
>
> Say something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9a5ca7b82bfc..8a6947e631dd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> - bool vfork;
> int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> - vfork = !!tsk->vfork_done;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
> if (old_mm)
> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
> vmacache_flush(tsk);
> task_unlock(tsk);
>
> - if (vfork)
> - timens_on_fork(tsk->nsproxy, tsk);
> -
> if (old_mm) {
> mmap_read_unlock(old_mm);
> BUG_ON(active_mm != old_mm);
> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>
> bprm->mm = NULL;
>
> + retval = exec_task_namespaces();
> + if (retval)
> + goto out_unlock;
> +
> #ifdef CONFIG_POSIX_TIMERS
> spin_lock_irq(&me->sighand->siglock);
> posix_cpu_timers_exit(me);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index cdb171efc7cb..fee881cded01 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
> int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> void exit_task_namespaces(struct task_struct *tsk);
> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
> +int exec_task_namespaces(void);
> void free_nsproxy(struct nsproxy *ns);
> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
> struct cred *, struct fs_struct *);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 90c85b17bf69..b4a799d9c50f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct *copy_process(
> return ERR_PTR(-EINVAL);
> }
>
> - /*
> - * If the new process will be in a different time namespace
> - * do not allow it to share VM or a thread group with the forking task.
> - *
> - * On vfork, the child process enters the target time namespace only
> - * after exec.
> - */
> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
> - if (nsp->time_ns != nsp->time_ns_for_children)
> - return ERR_PTR(-EINVAL);
> - }
pls don't remove this part. It was one of the concerns that vfork
doesn't work after unshare(CLONE_NEWTIME), but it is one of the standard
ways of creating a new process. For example, posix_spawn uses it.
> -
> if (clone_flags & CLONE_PIDFD) {
> /*
> * - CLONE_DETACHED is blocked so that we can potentially
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index b4cbb406bc28..b6647846fe42 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
> switch_task_namespaces(p, NULL);
> }
>
> +int exec_task_namespaces(void)
> +{
> + struct task_struct *tsk = current;
> + struct nsproxy *new;
> +
> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
> + return 0;
> +
> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> + if (IS_ERR(new))
> + return PTR_ERR(new);
> +
> + timens_on_fork(new, tsk);
> + switch_task_namespaces(tsk, new);
> + return 0;
> +}
> +
> +
> static int check_setns_flags(unsigned long flags)
> {
> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>
>
>
> To keep things from being too confusing it probably makes sense to
> rename the nsproxy variable from time_ns_for_children to
> time_ns_for_new_mm. Likewise timens_on_fork can be renamed
> timens_on_new_mm.
>
> But that would be follow up work.
>
> How does the above change sound to folks?
It looks good to me.
Thanks,
Andrei
Powered by blists - more mailing lists