[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f77ae6fe-a02c-1bd3-39d9-6cb829c3ccdd@gmail.com>
Date: Mon, 27 Apr 2020 22:06:56 +0200
From: "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To: Christian Brauner <christian.brauner@...ntu.com>,
linux-kernel@...r.kernel.org
Cc: mtk.manpages@...il.com, Alexander Viro <viro@...iv.linux.org.uk>,
Stéphane Graber <stgraber@...ntu.com>,
Linux Containers <containers@...ts.linux-foundation.org>,
"Eric W . Biederman" <ebiederm@...ssion.com>,
Serge Hallyn <serge@...lyn.com>,
Aleksa Sarai <cyphar@...har.com>
Subject: Re: [PATCH] nsproxy: attach to namespaces via pidfds
Hello Christian,
On 4/27/20 4:36 PM, Christian Brauner wrote:
> For quite a while we have been thinking about using pidfds to attach to
> namespaces.
(Sounds promising.)
> This patchset has existed for about a year already but we've
> wanted to wait to see how the general api would be received and adopted.
> Now that more and more programs in userspace have started using pidfds
> for process management it's time to send this one out.
>
> This patch makes it possible to use pidfds to attach to the namespaces
> of another process, i.e. they can be passed as the first argument to the
> setns() syscall. When only a single namespace type is specified the
> semantics are equivalent to passing an nsfd. That means
> setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
> when a pidfd is passed, multiple namespace flags can be specified in the
> second setns() argument and setns() will attach the caller to all the
> specified namespaces all at once or to none of them.
While I think I understand what the intended semantics are, the
description in the previous paragraph feels off, so that if
this whole text lands in a commit message (or a manual page),
I think it needs fixing.
Firs, it seems odd to say that
"setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET)"
setns(nsfd, CLONE_NEWNET) means: fail if nsfd does not refer to a
network namespace.
setns(pidfd, CLONE_NEWNET) means: move into just the network
namespace of the process referred to by 'pidfd'.
I would not call those two things "equal", in a semantic sense.
And then:
> If 0 is specified
> together with a pidfd then setns() will interpret it the same way 0 is
> interpreted together with a nsfd argument, i.e. attach to any/all
> namespaces.
If I understand right, setns(pidfd, 0) would mean: move into
all of the same namespaces as the process referred to by 'pidfd'.
But setns(nsfd, 0) means: move into whatever kind of namespace
is referred to by 'nsfd'.
I would not say of these two cases that 0 is interpreted
in the same way.
Hopefully I have not misunderstood.
> The obvious example where this is useful is a standard container
> manager interacting with a running container: pushing and pulling files
> or directories, injecting mounts, attaching/execing any kind of process,
> managing network devices all these operations require attaching to all
> or at least multiple namespaces at the same time. Given that nowadays
> most containers are spawned with all namespaces enabled we're currently
> looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
> nsfds, another 7 to actually perform the namespace switch. With time
> namespaces we're looking at about 16 syscalls.
> (We could amortize the first 7 or 8 syscalls for opening the nsfds by
> stashing them in each container's monitor process but that would mean
> we need to send around those file descriptors through unix sockets
> everytime we want to interact with the container or keep on-disk
> state. Even in scenarios where a caller wants to join a particular
> namespace in a particular order callers still profit from batching
> other namespaces. That mostly applies to the user namespace but
> all container runtimes I found join the user namespace first no matter
> if it privileges or deprivileges the container.)
> With pidfds this becomes a single syscall no matter how many namespaces
> are supposed to be attached to.
That does seem like a win. Thanks for working on this!
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Powered by blists - more mailing lists