[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190326195151.GD114492@google.com>
Date: Tue, 26 Mar 2019 15:51:51 -0400
From: Joel Fernandes <joel@...lfernandes.org>
To: Christian Brauner <christian@...uner.io>
Cc: jannh@...gle.com, khlebnikov@...dex-team.ru, luto@...nel.org,
dhowells@...hat.com, serge@...lyn.com, ebiederm@...ssion.com,
linux-api@...r.kernel.org, linux-kernel@...r.kernel.org,
arnd@...db.de, keescook@...omium.org, adobriyan@...il.com,
tglx@...utronix.de, mtk.manpages@...il.com, bl0pbl33p@...il.com,
ldv@...linux.org, akpm@...ux-foundation.org, oleg@...hat.com,
nagarathnam.muthusamy@...cle.com, cyphar@...har.com,
viro@...iv.linux.org.uk, dancol@...gle.com
Subject: Re: [PATCH v1 2/4] pid: add pidctl()
On Tue, Mar 26, 2019 at 08:19:43PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 02:10:12PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > > > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > > >
> > > > > "Each process have different pids, one for each pid namespace it belongs.
> > > > > When interaction happens within single pid-ns translation isn't required.
> > > > > More complicated scenarios needs special handling.
> > > > >
> > > > > For example:
> > > > > - reading pid-files or logs written inside container with pid namespace
> > > > > - writing logs with internal pids outside container for pushing them into
> > > > > - attaching with ptrace to tasks from different pid namespace
> > > > >
> > > > > Generally speaking, any cross pid-ns API with pids needs translation.
> > > > >
> > > > > Currently there are several interfaces that could be used here:
> > > > >
> > > > > Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > > >
> > > > > Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > > > > In some cases pid translation could be easily done using this information.
> > > > > Backward translation requires scanning all tasks and becomes really
> > > > > complicated for deeper namespace nesting.
> > > > >
> > > > > Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > > > > This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > > > > into pid namespace, this expose process and could be insecure."
> > > > >
> > > > > The original patchset allowed two distinct operations implicitly:
> > > > > - discovering whether pid namespaces (pidns) have a parent-child
> > > > > relationship
> > > > > - translating a pid from a source pidns into a target pidns
> > > > >
> > > > > Both tasks are accomplished in the original patchset by passing a pid
> > > > > along. If the pid argument is passed as 1 the relationship between two pid
> > > > > namespaces can be discovered.
> > > > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > > > userspace if the task it is asked to perform is passed through a
> > > > > command argument. Additionally, it allows us to remove an intrinsic race
> > > > > caused by using the pid argument as a way to discover the relationship
> > > > > between pid namespaces.
> > > > > This patch introduces three commands:
> > > > >
> > > > > /* PIDCMD_QUERY_PID */
> > > > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > > > Given a source pid namespace fd return the pid of the process in the target
> > > > > namespace:
> > > >
> > > > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > > > IMO (see below).
> > >
> > > Yes, doesn't matter to me too much what we call it.
> >
> > Ok.
> >
> > > > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > > > > - retrieve pidns identified by source_fd
> > > > > - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > > - retrieve callers pidns
> > > > > - return pid in callers pidns
> > > > >
> > > > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > > > > - retrieve callers pidns
> > > > > - retrieve struct pid identifed by pid in callers pidns
> > > > > - retrieve pidns identified by target_fd
> > > > > - return pid in pidns identified by target_fd
> > > > >
> > > > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > > > > - retrieve pidns identified by source_fd
> > > > > - retrieve struct pid identifed by init task in pidns identified by source_fd
> > > > > - retrieve callers pidns
> > > > > - return pid of init task of pidns identified by source_fd in callers pidns
> > > > >
> > > > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > > > > - retrieve pidns identified by source_fd
> > > > > - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > > - retrieve pidns identified by target_fd
> > > > > - check whether struct pid can be found in pidns identified by target_fd
> > > > > - return pid in pidns identified by target_fd
> > > > >
> > > > > /* PIDCMD_QUERY_PIDNS */
> > > > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > > > namespaces.
> > > > > In the original version of the pachset passing pid as 1 would allow to
> > > > > deterimine the relationship between the pid namespaces. This is inherhently
> > > > > racy. If pid 1 inside a pid namespace has died it would report false
> > > > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > > > died, it would report that the target pid namespace cannot be reached from
> > > > > the source pid namespace because it couldn't find the pid inside of the
> > > > > target pid namespace and thus falsely report to the user that the two pid
> > > > > namespaces are not related. This problem is simple to avoid. In the new
> > > > > version we simply walk the list of ancestors and check whether the
> > > > > namespace are related to each other. By doing it this way we can reliably
> > > > > report what the relationship between two pid namespace file descriptors
> > > > > looks like.
> > > > >
> > > > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > > > > - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > > >
> > > > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > > > > - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > > >
> > > > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > > > > - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > > >
> > > > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > > > > - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > > >
> > > > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> > >
> > > Same.
> >
> > Ok, glad we agree on it.
> >
> > > >
> > > > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > > > now you're calling QUERY to mean compare something. I suggest to be explicit
> > > > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > > >
> > > > Arguably, 2 syscalls for this is cleaner:
> > > > pid_compare_namespaces(ns_fd1, ns_fd2);
> > > > pid_translate(pid, ns_fd1, nds_fd2);
> > >
> > > I don't think we want to send out pid_compare_namespaces() as a separate
> > > syscall. If that's the consensus I'd rather just drop this functionality
> > > completely.
> >
> > I mean, if pid-namespace fd comparison functionality is not needed in
> > userspace, why add it all. I suggest to drop it if not needed and can be
> > considered for addition later when needed.
> >
> > Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.
> >
> > > > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > > > improve the functionality expressed implicitly in translate_pid() before.
> > > > >
> > > > > /* PIDCMD_GET_PIDFD */
> > > >
> > > > And this can be a third syscall:
> > > > pidfd_translate(pid, ns_fd1, ns_fd2).
> > >
> > > Sigh, yes. But I honestly want other people other than us to come out
> > > and say "yes, please send a PR to Linus with three separate syscalls for
> > > very closely related functionality". There's a difference between "this
> > > is how we would ideally like to do it" and "this is what is common
> > > practice and will likely get accepted". But I'm really not opposed to it
> > > per se.
> >
> > Ok.
> >
> > > > I am actually supportive of Daniel's view that by combining too many
> > > > arguments into a single syscall, becomes confusing and sometimes some
> > > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > >
> > > There's a difference between an ioctl() and say seccomp() which this is
> > > close to:
> > > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > > The point is that the functionality is closely related not just randomly
> > > unrelated stuff. But as I said I'm more than willing to compromise.
> >
> > Sounds great, yeah whatever makes sense.
>
> One option is to do the following which removes the cmd argument all
> together in favor of a flag GET_PIDFD:
>
> +SYSCALL_DEFINE4(pidfd_translate, pid_t, pid, int, source, int, target,
> + unsigned int, flags)
> +{
Yes that's also fine. But would it not be better to just add the flags as an
extra parameter to translate_pid syscall, telling it what to translate to
(pidfd or pid)?
thanks,
- Joel
> + struct pid_namespace *source_ns, *target_ns;
> + struct pid *struct_pid;
> + pid_t result;
> +
> + if (flags & ~GET_PIDFD)
> + return -EINVAL;
> +
> + source_ns = get_pid_ns_by_fd(source);
> + if (IS_ERR(source_ns))
> + return PTR_ERR(source_ns);
> +
> + target_ns = get_pid_ns_by_fd(target);
> + if (IS_ERR(target_ns)) {
> + put_pid_ns(source_ns);
> + return PTR_ERR(target_ns);
> + }
> +
> + rcu_read_lock();
> + struct_pid = get_pid(find_pid_ns(pid, source_ns));
> + rcu_read_unlock();
> +
> + if (struct_pid)
> + result = pid_nr_ns(struct_pid, target_ns);
> + else
> + result = -ESRCH;
> +
> + if ((flags & GET_PIDFD) && (result > 0))
> + result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> +
> + if (!result)
> + result = -ENOENT;
> +
> + put_pid(struct_pid);
> + put_pid_ns(target_ns);
> + put_pid_ns(source_ns);
> +
> + return result;
> +}
Powered by blists - more mailing lists