linux-kernel - Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <871s1j777c.fsf@oldenburg2.str.redhat.com>
Date:   Tue, 30 Apr 2019 19:07:19 +0200
From:   Florian Weimer <fweimer@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Jann Horn <jannh@...gle.com>, Kevin Easton <kevin@...rana.org>,
        Andy Lutomirski <luto@...nel.org>,
        Christian Brauner <christian@...uner.io>,
        Aleksa Sarai <cyphar@...har.com>,
        "Enrico Weigelt\, metux IT consult" <lkml@...ux.net>,
        Al Viro <viro@...iv.linux.org.uk>,
        David Howells <dhowells@...hat.com>,
        Linux API <linux-api@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "Serge E. Hallyn" <serge@...lyn.com>,
        Arnd Bergmann <arnd@...db.de>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Kees Cook <keescook@...omium.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Daniel Colascione <dancol@...gle.com>
Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

* Linus Torvalds:

> On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> Of course, if you *don't* need the exact vfork() semantics, clone
>> itself actually very much supports a callback model with s separate
>> stack. You can basically do this:
>>
>>  - allocate new stack for the child
>>  - in trivial asm wrapper, do:
>>     - push the callback address on the child stack
>>     - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>>     - "ret"
>>  - free new stack
>>
>> where the "ret" in the child will just go to the callback, while the
>> parent (eventually) just returns from the trivial wrapper and frees
>> the new stack (which by definition is no longer used, since the child
>> has exited or execve'd.
>
> In fact, Florian, maybe this is the solution to your "I want to use
> vfork for posix_spawn(), but I don't know if I can trust it" problem.
>
> Just use clone() directly. On WSL it will presumably just fail, and
> you can then fall back on doing the slow stupid
> fork+pipes-to-communicate.

We already use clone.  I don't know why.  We should add a comment that
provides the reason.

> On valgrind, I don't know what will happen. Maybe it will just do an
> unchecked posix_spawn() because valgrind doesn't catch it?

I think what happens with these emulators that they use fork (no shared
address space) but suspend the parent thread.  clone with CLONE_VFORK
definitely does not fail.  That mostly works, except that you do not get
back the error code from the execve.  Instead, the process is considered
launched, and the caller collects the exit status from the _exit after
the failed execve.

> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
> 
>  - allocate new stack for the child
>  - in trivial asm wrapper, do:
>     - push the callback address on the child stack
>     - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>     - "ret"
>  - free new stack
> 
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
> 
> So you can most definitely create a "vfork_with_child_callback()" with
> clone, and it would arguably be a much superior interface to vfork()
> anyway (maybe you'd like to pass in some arguments to the callback too
> - add more stack setup for the child as needed), but it wouldn't be
> the right solution for programs that just want to use the standard BSD
> vfork() model.

As far as we understand the situation, we believe that we absolutely
must block all signals for both the parent thread and the new
subprocess.  Signals can be unblocked in the subprocess, but only after
setting their handlers to SIG_DFL or SIG_IGN.  (Parent signal handlers
cannot run in the subprocess because application-supplied signal
handlers generally do not expect to run with a corrupt TCB—or a
different PID.)

At that point, I wonder if we can just skip the stack setup for the
CLONE_VFORK case and reuse the existing stack.

Thanks,
Florian