linux-kernel - Re: [PATCH] nsproxy: attach to namespaces via pidfds

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f77ae6fe-a02c-1bd3-39d9-6cb829c3ccdd@gmail.com>
Date:   Mon, 27 Apr 2020 22:06:56 +0200
From:   "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:     Christian Brauner <christian.brauner@...ntu.com>,
        linux-kernel@...r.kernel.org
Cc:     mtk.manpages@...il.com, Alexander Viro <viro@...iv.linux.org.uk>,
        Stéphane Graber <stgraber@...ntu.com>,
        Linux Containers <containers@...ts.linux-foundation.org>,
        "Eric W . Biederman" <ebiederm@...ssion.com>,
        Serge Hallyn <serge@...lyn.com>,
        Aleksa Sarai <cyphar@...har.com>
Subject: Re: [PATCH] nsproxy: attach to namespaces via pidfds

Hello Christian,

On 4/27/20 4:36 PM, Christian Brauner wrote:
> For quite a while we have been thinking about using pidfds to attach to
> namespaces. 

(Sounds promising.)

> This patchset has existed for about a year already but we've
> wanted to wait to see how the general api would be received and adopted.
> Now that more and more programs in userspace have started using pidfds
> for process management it's time to send this one out.
> 
> This patch makes it possible to use pidfds to attach to the namespaces
> of another process, i.e. they can be passed as the first argument to the
> setns() syscall. When only a single namespace type is specified the
> semantics are equivalent to passing an nsfd. That means
> setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
> when a pidfd is passed, multiple namespace flags can be specified in the
> second setns() argument and setns() will attach the caller to all the
> specified namespaces all at once or to none of them. 

While I think I understand what the intended semantics are, the
description in the previous paragraph feels off, so that if 
this whole text lands in a commit message (or a manual page),
I think it needs fixing.

Firs, it seems odd to say that 

   "setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET)"

setns(nsfd, CLONE_NEWNET) means: fail if nsfd does not refer to a
network namespace.

setns(pidfd, CLONE_NEWNET) means: move into just the network
namespace of the process referred to by 'pidfd'.

I would not call those two things "equal", in a semantic sense.

And then:

> If 0 is specified
> together with a pidfd then setns() will interpret it the same way 0 is
> interpreted together with a nsfd argument, i.e. attach to any/all
> namespaces.

If I understand right, setns(pidfd, 0) would mean: move into
all of the same namespaces as the process referred to by 'pidfd'.

But setns(nsfd, 0) means: move into whatever kind of namespace
is referred to by 'nsfd'.

I would not say of these two cases that 0 is interpreted
in the same way.

Hopefully I have not misunderstood.



> The obvious example where this is useful is a standard container
> manager interacting with a running container: pushing and pulling files
> or directories, injecting mounts, attaching/execing any kind of process,
> managing network devices all these operations require attaching to all
> or at least multiple namespaces at the same time. Given that nowadays
> most containers are spawned with all namespaces enabled we're currently
> looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
> nsfds, another 7 to actually perform the namespace switch. With time
> namespaces we're looking at about 16 syscalls.
> (We could amortize the first 7 or 8 syscalls for opening the nsfds by
>  stashing them in each container's monitor process but that would mean
>  we need to send around those file descriptors through unix sockets
>  everytime we want to interact with the container or keep on-disk
>  state. Even in scenarios where a caller wants to join a particular
>  namespace in a particular order callers still profit from batching
>  other namespaces. That mostly applies to the user namespace but
>  all container runtimes I found join the user namespace first no matter
>  if it privileges or deprivileges the container.)
> With pidfds this becomes a single syscall no matter how many namespaces
> are supposed to be attached to.

That does seem like a win. Thanks for working on this!

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/