lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 02 May 2020 07:35:53 -0500
From: (Eric W. Biederman)
To:     Christian Brauner <>
Cc:     Jann Horn <>,,
        Linux Containers <>,,
        Alexander Viro <>,
        Michael Kerrisk <>
Subject: Re: [PATCH v2 1/4] capability: add ns_capable_cred()

Christian Brauner <> writes:

> On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <> writes:
>> > Add a simple capability helper which makes it possible to determine
>> > whether a set of creds is ns capable wrt to the passed in credentials.
>> > This is not something exciting it's just a more pleasant wrapper around
>> > security_capable() by allowing ns_capable_common() to ake a const struct
>> > cred argument. In ptrace_has_cap() for example, we're using
>> > security_capable() directly. ns_capable_cred() will be used in the next
>> > patch to check against the target credentials the caller is going to
>> > switch to.
>> Given that this is to suppot setns.  I don't understand the
>> justification for this.
>> Is it your intention to use the reduced permissions that you get
>> when you install a user namespace?
> Indeed.
>> Why do you want to use the reduced permissions when installing multiple
>> namespaces at once?
> The intention is to use the target credentials we are going to install
> when setns() hits the point of no return. The target permissions are
> either the permissions of the caller or the reduced permissions if
> CLONE_NEWUSER has been requested. This has multiple reasons.
> The most obvious reason imho is that all other namespaces have an owning
> user namespace. Attaching to any other namespace requires the attacher
> to be privileged over the owning user namespace of that namespace.
> Consequently, all our current install handlers for every namespace we
> have check whether we are privileged in the owning user namespace of
> that user namespace. So in order to attach to any of those namespaces -
> especially when attaching as an unprivileged user - requires that we are
> attached to the user namespace first.

No actually it doesn't.  As if you have privileges to attach to the user
namespace you have the privileges to attach to anything it owns.  Or you
should I think I am missing some subtle detail at the moment.

> (That's especially useful given
> that most users especially container runtimes will unshare all
> namespaces. Doing it this way we can not just have attach privileged
> users attach to their containers but also unprivileged users to their
> containers in one shot.)

That is a wonderful reason for doing things, and it is the reason
why I am asking about it because I think you have it backwards.

Especially in the context of some container runtimes like Kubernetes
that I have been told will do things like share a network namespace
across all containers in a POD.

> A few other points about this. If one looks at clone(CLONE_NEW*) or
> unshare(CLONE_NEW*) then the ordering and permissions checking is the
> same way. All permissions checks are performed against the reduced
> permissions, i.e. if CLONE_NEWUSER is specified you check privilege
> against the reduced permissions too otherwise you wouldn't be able to
> spawn into a complete set of new namespaces as an unprivileged user.

That is a good catch and definitely a reason for looking at doing
things in this order.

For unshare and clone putting things in a user namespace means you can
create namespaces you could not create otherwise.

> This logic is also expressed in how setns() is already used in
> userspace. Any container runtime will attach to the user namespace first,
> so all subsequent calls to attach to other namespaces perform the checks
> against the reduced permissions. It also has to be that way because of
> fully unprivileged containers.

So I sat and looked.  For nsetner it winds up trying to enter
the namespaces in either order.

         * Now that we know which namespaces we want to enter, enter
         * them.  Do this in two passes, not entering the user
         * namespace on the first pass.  So if we're deprivileging the
         * container we'll enter the user namespace last and if we're
         * privileging it then we enter the user namespace first
         * (because the initial setns will fail).
        for (pass = 0; pass < 2; pass ++) {
                for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
                        if (nsfile->fd < 0)
                        if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
                                do_fork = 1;
                        if (setns(nsfile->fd, nsfile->nstype)) {
                                if (pass != 0)
                                            _("reassociate to namespace '%s' failed"),

                        nsfile->fd = -1;

Looking a little close we have at least for entering the mntns the
following checks:

	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.

So there is defintiely an issue there.

I suspect the clean approach is to simply require CAP_SYS_CHROOT in
the new user namespace, when we are setting several of them at once.

> To put it another way, if we were to always perform the permission checks
> against the current permissions (i.e. no matter if CLONE_NEWUSER is
> specified or not) then we'd never be able to attach to a set of
> namespaces at once as an unprivileged user.
> We also really want to be able to express both semantics:
> 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
>    current permission level
> 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
>    permission level
> It feels weird if both 1 and 2 would mean the exact same thing given
> that the user namespace has an owernship relation with all the other
> namespaces.

It feels weird to me to disallow anything that we have permissions for.

Can you dig up the actual install permissions checks and see if there is
anything other than the mount namespace that needs permissions in the
current user namespace?

Please let's walk this through.  I think there should be a way to
carefully phrase the permission checks that we don't need
ns_capable_cred that will allow goofy cases like setns into Kuberneties
PODs that share network namespaces.

I believe that will be a way to phrase the permission checks so that
with or without CLONE_NEWUSER they make sense, and give very similar

Certainly attaching to a mount namespace is going to need either being
root or attaching to a user namespace at the same time.  Because
attaching to a mount namespace needs functionality that the user
namespace provides.


Powered by blists - more mailing lists