linux-kernel - Re: [PATCH 0/3] Introduce user namespace capabilities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <799f3963-1f24-47a1-9e19-8d0ad3a49e45@schaufler-ca.com>
Date: Sun, 19 May 2024 10:03:29 -0700
From: Casey Schaufler <casey@...aufler-ca.com>
To: Serge Hallyn <serge@...lyn.com>
Cc: Jonathan Calmels <jcalmels@...0.net>, Jarkko Sakkinen
 <jarkko@...nel.org>, brauner@...nel.org, ebiederm@...ssion.com,
 Luis Chamberlain <mcgrof@...nel.org>, Kees Cook <keescook@...omium.org>,
 Joel Granados <j.granados@...sung.com>, Paul Moore <paul@...l-moore.com>,
 James Morris <jmorris@...ei.org>, David Howells <dhowells@...hat.com>,
 containers@...ts.linux.dev, linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-security-module@...r.kernel.org,
 keyrings@...r.kernel.org, Casey Schaufler <casey@...aufler-ca.com>
Subject: Re: [PATCH 0/3] Introduce user namespace capabilities

On 5/18/2024 5:20 AM, Serge Hallyn wrote:
> On Fri, May 17, 2024 at 10:53:24AM -0700, Casey Schaufler wrote:
>> On 5/17/2024 4:42 AM, Jonathan Calmels wrote:
>>>>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote:
>>>>>>> I suggest that adding a capability set for user namespaces is a bad idea:
>>>>>>> 	- It is in no way obvious what problem it solves
>>>>>>> 	- It is not obvious how it solves any problem
>>>>>>> 	- The capability mechanism has not been popular, and relying on a
>>>>>>> 	  community (e.g. container developers) to embrace it based on this
>>>>>>> 	  enhancement is a recipe for failure
>>>>>>> 	- Capabilities are already more complicated than modern developers
>>>>>>> 	  want to deal with. Adding another, special purpose set, is going
>>>>>>> 	  to make them even more difficult to use.
>>> Sorry if the commit wasn't clear enough.
>> While, as others have pointed out, the commit description left
>> much to be desired, that isn't the biggest problem with the change
>> you're proposing.
>>
>>>  Basically:
>>>
>>> - Today user namespaces grant full capabilities.
>> Of course they do. I have been following the use of capabilities
>> in Linux since before they were implemented. The uptake has been
>> disappointing in all use cases.
>>
>>>   This behavior is often abused to attack various kernel subsystems.
>> Yes. The problems of a single, all powerful root privilege scheme are
>> well documented.
>>
>>>   Only option
>> Hardly.
>>
>>>  is to disable them altogether which breaks a lot of
>>>   userspace stuff.
>> Updating userspace components to behave properly in a capabilities
>> environment has never been a popular activity, but is the right way
>> to address this issue. And before you start on the "no one can do that,
>> it's too hard", I'll point out that multiple UNIX systems supported
>> rootless, all capabilities based systems back in the day. 
>>
>>>   This goes against the least privilege principle.
>> If you're going to run userspace that *requires* privilege, you have
>> to have a way to *allow* privilege. If the userspace insists on a root
>> based privilege model, you're stuck supporting it. Regardless of your
>> principles.
> Casey,
>
> I might be wrong, but I think you're misreading this patchset.  It is not
> about limiting capabilities in the init user ns at all.  It's about limiting
> the capabilities which a process in a child userns can get.

I do understand that. My objection is not to the intent, but to the approach.
Adding a capability set to the general mechanism in support of a limited, specific
use case seems wrong to me. I would rather see a mechanism in userns to limit
the capabilities in a user namespace than a mechanism in capabilities that is
specific to user namespaces.

> Any unprivileged task can create a new userns, and get a process with
> all capabilities in that namespace.  Always.  User namespaces were a
> great success in that we can do this without any resulting privilege
> against host owned resources.  The unaddressed issue is the expanded
> kernel code surface area.

An option to clone() then, to limit the capabilities available?
I honestly can't recall if that has been suggested elsewhere, and
apologize if it's already been dismissed as a stoopid idea.

>
> You say, above, (quoting out of place here)
>
>> Updating userspace components to behave properly in a capabilities
>> environment has never been a popular activity, but is the right way
>> to address this issue. And before you start on the "no one can do that,
>> it's too hard", I'll point out that multiple UNIX systems supported
> He's not saying no one can do that.  He's saying, correctly, that the
> kernel currently offers no way for userspace to do this limiting.  His
> patchset offers two ways: one system wide capability mask (which applies
> only to non-initial user namespaces) and on per-process inherited one
> which - yay - userspace can use to limit what its children will be
> able to get if they unshare a user namespace.
>
>>> - It adds a new capability set.
>> Which is a really, really bad idea. The equation for calculating effective
>> privilege is already more complicated than userspace developers are generally
>> willing to put up with.
> This is somewhat true, but I think the semantics of what is proposed here are
> about as straightforward as you could hope for, and you can basically reason
> about them completely independently of the other sets.  Only when reasoning
> about the correctness of this code do you need to consider the other sets.  Not
> when administering a system.
>
> If you want root in a child user namespace to not have CAP_MAC_ADMIN, you drop
> it from your pU.  Simple as that.
>
>>>   This set dictates what capabilities are granted in namespaces (instead
>>>   of always getting full caps).
>> I would not expect container developers to be eager to learn how to use
>> this facility.
> I'm a container developer, and I'm excited about it :)

OK, well, I'm wrong. It's happened before and will happen again.

>
>>>   This brings namespaces in line with the rest of the system, user
>>>   namespaces are no more "special".
>> I'm sorry, but this makes no sense to me whatsoever. You want to introduce
>> a capability set explicitly for namespaces in order to make them less
>> special?
> Yes, exactly.

Hmm. I can't say I buy that. It makes a whole lot more sense to me to
change userns than to change capabilities.

>
>> Maybe I'm just old and cranky.
> That's fine.
>
>>>   They now work the same way as say a transition to root does with
>>>   inheritable caps.
>> That needs some explanation.
>>
>>> - This isn't intended to be used by end users per se (although they could).
>>>   This would be used at the same places where existing capabalities are
>>>   used today (e.g. init system, pam, container runtime, browser
>>>   sandbox), or by system administrators.
>> I understand that. It is for containers. Containers are not kernel entities.
> User namespaces are.
>
> This patch set provides userspace a way of limiting the kernel code exposed
> to untrusted children, which currently does not exist.

Yes, I understand. I would rather see a change to userns in support of a userns
specific need than a change to capabilities for a userns specific need.

>>> To give you some ideas of things you could do:
>>>
>>> # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH
>>> echo "auth optional pam_cap.so" >> /etc/pam.d/sshd
>>> echo "!cap_net_admin alice" >> /etc/security/capability.conf.
>>>
>>> # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE
>>> systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \
>>>             -p SecureBits=userns-strict-caps \
>>>             /usr/bin/dockerd
>>>
>>> # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits
>>> # Prevent users from ever gaining it
>>> sysctl -w cap_bound_userns_mask=0x1fffffdffff