linux-kernel - Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87v9ywbkp8.fsf@oldenburg2.str.redhat.com>
Date:   Mon, 29 Apr 2019 22:49:55 +0200
From:   Florian Weimer <fweimer@...hat.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Kevin Easton <kevin@...rana.org>,
        Andy Lutomirski <luto@...nel.org>,
        Christian Brauner <christian@...uner.io>,
        Aleksa Sarai <cyphar@...har.com>,
        "Enrico Weigelt\, metux IT consult" <lkml@...ux.net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        David Howells <dhowells@...hat.com>,
        Linux API <linux-api@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "Serge E. Hallyn" <serge@...lyn.com>,
        Arnd Bergmann <arnd@...db.de>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Kees Cook <keescook@...omium.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Daniel Colascione <dancol@...gle.com>
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Jann Horn:

>> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
>> <clone flags and arguments, maybe in a struct>)
>>
>> and then you'd use it like this to fork off a child process:
>>
>> int spawn_shell_subprocess_(void *arg) {
>>   char *cmdline = arg;
>>   execl("/bin/sh", "sh", "-c", cmdline);
>>   return -1;
>> }
>> pid_t spawn_shell_subprocess(char *cmdline) {
>>   pid_t child_pid;
>>   int res = clone_temporary(spawn_shell_subprocess_, cmdline,
>> &child_pid, [...]);
>>   if (res == 0) return child_pid;
>>   return res;
>> }
>>
>> clone_temporary() could be implemented roughly as follows by the libc
>> (or other userspace code):
>>
>> sigset_t sigset, sigset_old;
>> sigfillset(&sigset);
>> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
>> int child_pid;
>> int result = 0;
>> /* starting here, use inline assembly to ensure that no stack
>> allocations occur */
>> long child = syscall(__NR_clone,
>> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
>> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
>> if (child == -1) { result = -1; goto reset_sigmask; }
>> if (child == 0) {
>>   result = fn(arg);
>>   syscall(__NR_exit, 0);
>> }
>> futex(&child_pid, FUTEX_WAIT, child, NULL);
>> /* end of no-stack-allocations zone */
>> reset_sigmask:
>> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
>> return result;
>
> ... I guess that already has a name, and it's called vfork(). (Well,
> except that the Linux vfork() isn't a real vfork().)
>
> So I guess my question is: Why not vfork()?

Mainly because some users want access to the clone flags, and that's not
possible with the current userspace wrappers.  The stack setup for the
undocumented clone wrapper is also cumbersome, and the ia64 pecularity
annoying.

For the stack sharing, the callback-based interface looks like the
absolutely right thing to do to me.  It enforces the notion that you can
safely return on the child path from a function calling vfork.

> And if vfork() alone isn't flexible enough, alternatively: How about
> an API that forks a new child in the same address space, and then
> allows the parent to invoke arbitrary syscalls in the context of the
> child?

As long it's not an eBPF script …

> You could also build that in userspace if you wanted, I think - just
> let the child run an assembly loop that reads registers from a unix
> seqpacket socket, invokes the syscall instruction, and writes the
> value of the result register back into the seqpacket socket. As long
> as you use CLONE_VM, you don't have to worry about moving the pointer
> targets of syscalls. The user-visible API could look like this:

People already use a variant of this, execve'ing twice.  See
jspawnhelper.

Thanks,
Florian