[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHrFyr5QNTb-y4wO0vph4u7LZtEdr5A+KtvxPokAUrkUQgb-5A@mail.gmail.com>
Date: Mon, 29 Apr 2019 22:52:50 +0200
From: Christian Brauner <christian@...uner.io>
To: Florian Weimer <fweimer@...hat.com>
Cc: Jann Horn <jannh@...gle.com>, Kevin Easton <kevin@...rana.org>,
Andy Lutomirski <luto@...nel.org>,
Aleksa Sarai <cyphar@...har.com>,
"Enrico Weigelt, metux IT consult" <lkml@...ux.net>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Al Viro <viro@...iv.linux.org.uk>,
David Howells <dhowells@...hat.com>,
Linux API <linux-api@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
"Serge E. Hallyn" <serge@...lyn.com>,
Arnd Bergmann <arnd@...db.de>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Kees Cook <keescook@...omium.org>,
Thomas Gleixner <tglx@...utronix.de>,
Michael Kerrisk <mtk.manpages@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Joel Fernandes <joel@...lfernandes.org>,
Daniel Colascione <dancol@...gle.com>
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
On Mon, Apr 29, 2019 at 10:50 PM Florian Weimer <fweimer@...hat.com> wrote:
>
> * Jann Horn:
>
> >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> >> <clone flags and arguments, maybe in a struct>)
> >>
> >> and then you'd use it like this to fork off a child process:
> >>
> >> int spawn_shell_subprocess_(void *arg) {
> >> char *cmdline = arg;
> >> execl("/bin/sh", "sh", "-c", cmdline);
> >> return -1;
> >> }
> >> pid_t spawn_shell_subprocess(char *cmdline) {
> >> pid_t child_pid;
> >> int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> >> &child_pid, [...]);
> >> if (res == 0) return child_pid;
> >> return res;
> >> }
> >>
> >> clone_temporary() could be implemented roughly as follows by the libc
> >> (or other userspace code):
> >>
> >> sigset_t sigset, sigset_old;
> >> sigfillset(&sigset);
> >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
> >> int child_pid;
> >> int result = 0;
> >> /* starting here, use inline assembly to ensure that no stack
> >> allocations occur */
> >> long child = syscall(__NR_clone,
> >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
> >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
> >> if (child == -1) { result = -1; goto reset_sigmask; }
> >> if (child == 0) {
> >> result = fn(arg);
> >> syscall(__NR_exit, 0);
> >> }
> >> futex(&child_pid, FUTEX_WAIT, child, NULL);
> >> /* end of no-stack-allocations zone */
> >> reset_sigmask:
> >> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
> >> return result;
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
> >
> > So I guess my question is: Why not vfork()?
>
> Mainly because some users want access to the clone flags, and that's not
> possible with the current userspace wrappers. The stack setup for the
> undocumented clone wrapper is also cumbersome, and the ia64 pecularity
> annoying.
>
> For the stack sharing, the callback-based interface looks like the
> absolutely right thing to do to me. It enforces the notion that you can
> safely return on the child path from a function calling vfork.
>
> > And if vfork() alone isn't flexible enough, alternatively: How about
> > an API that forks a new child in the same address space, and then
> > allows the parent to invoke arbitrary syscalls in the context of the
> > child?
>
> As long it's not an eBPF script …
You shouldn't even joke about this (I'm serious.).
I'm very certain there are people who'd think this is a good idea.
>
> > You could also build that in userspace if you wanted, I think - just
> > let the child run an assembly loop that reads registers from a unix
> > seqpacket socket, invokes the syscall instruction, and writes the
> > value of the result register back into the seqpacket socket. As long
> > as you use CLONE_VM, you don't have to worry about moving the pointer
> > targets of syscalls. The user-visible API could look like this:
>
> People already use a variant of this, execve'ing twice. See
> jspawnhelper.
>
> Thanks,
> Florian
Powered by blists - more mailing lists