[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190331150507.zpyugdvtmr6rgpda@brauner.io>
Date: Sun, 31 Mar 2019 17:05:08 +0200
From: Christian Brauner <christian@...uner.io>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Jann Horn <jannh@...gle.com>,
Joel Fernandes <joel@...lfernandes.org>,
Daniel Colascione <dancol@...gle.com>,
Andrew Lutomirski <luto@...nel.org>,
David Howells <dhowells@...hat.com>,
"Serge E. Hallyn" <serge@...lyn.com>,
Linux API <linux-api@...r.kernel.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Arnd Bergmann <arnd@...db.de>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
Kees Cook <keescook@...omium.org>,
Alexey Dobriyan <adobriyan@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Michael Kerrisk-manpages <mtk.manpages@...il.com>,
Jonathan Kowalski <bl0pbl33p@...il.com>,
"Dmitry V. Levin" <ldv@...linux.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Nagarathnam Muthusamy <nagarathnam.muthusamy@...cle.com>,
Aleksa Sarai <cyphar@...har.com>,
Al Viro <viro@...iv.linux.org.uk>
Subject: Re: [PATCH v2 0/5] pid: add pidfd_open()
On Sun, Mar 31, 2019 at 07:52:28AM -0700, Linus Torvalds wrote:
> On Sat, Mar 30, 2019 at 9:47 PM Jann Horn <jannh@...gle.com> wrote:
> >
> > Sure, given a pidfd_clone() syscall, as long as the parent of the
> > process is giving you a pidfd for it and you don't have to deal with
> > grandchildren created by fork() calls outside your control, that
> > works.
>
> Don't do pidfd_clone() and pidfd_wait().
>
> Both of those existing system calls already get a "flags" argument.
> Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
> make the existing system calls just take/return a pidfd.
Yes, that's one of the options I was considering but was afraid of
pitching it because of the very massive opposition I got
regarding"multiplexers". I'm perfectly happy with doing it this way.
>
> Side note: we could (should?) also make the default maxpid just be
> larger. It needs to fit in an 'int', but MAXINT instead of 65535 would
> likely alreadt make a lot of these attacks harder.
Yes, agreed.
>
> There was some really old legacy reason why we actually limited it to
> 65535 originally. It was old and crufty even back when..
So Jann and I have been thinking about going forward with the following
idea:
With the pidfd_open() patchset I have pidfds are simple anone inode file
descriptors stashing a reference to struct pid of a process. I have
mentioned this is in prior mails. This cleanly decouples pidfds from
procfs completely.
The reason why we want to use pidfds with no connection to a specific procfs
instance, even in environments that have procfs, is that we would like to add
the API to clone with CLONE_PIDFD that you just mentioned that creates a
new process or thread and returns a pidfd to it. In the context of such
a syscall, it would be awkward to have the kernel open a file in some
procfs instance, since then userspace would have to specify which procfs
instance the fd should come from.
There is an argument to be made that for consistency's sake we should - although
I don't feel strongly about it - disallow the usage of pidfd_send_signal() with
fds referring to /proc/<pid> then. Unless you want this to always work. If you
want this to work then we would simply submit pidfd_open() for the 5.2
window. If you agree that it makes sense to only have pidfd_open() file
descriptors working with pidfd_send_signal() we would send a revert for
pidfd_send_signal() now and resubmit it together with pidfd_open()
during the 5.2. merge window. This decouples pidfds completely from
procfs not just when it is not compiled in or mounted.
I very much care about this being done right. If this means temporarily
kicking pidfd_send_signal() out until 5.2 I'm happy to do so.
Btw, the /proc/<pid> race issue that is mentioned constantly is simply
avoidable by placing the pid that the pidfd has stashed relative to the
callers' procfs mount's pid namespace in the pidfd's fdinfo. So there's
not even a need to really go through /proc/<pid> in the first place. A
caller wanting to get metadata access and avoid a race with pid
recycling can then simply do:
int pidfd = pidfd_open(pid, 0);
int pid = parse_fdinfo("/proc/self/fdinfo/<pidfd>");
int procpidfd = open("/proc/<pid>", ...);
/* Test if process still exists by sending signal 0 through our pidfd. */
int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD);
if (ret < 0 && errno == ESRCH) {
/* pid has been recycled and procpidfd refers to another process */
}
Christian
Powered by blists - more mailing lists