netdev - Re: [PATCH] net: sysctl: fix edge case wrt. sysctl write access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87msuc3rxb.fsf@email.froward.int.ebiederm.org>
Date: Fri, 15 Dec 2023 00:23:28 -0600
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: Maciej Żenczykowski <maze@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>,  Linux Network Development Mailing List
 <netdev@...r.kernel.org>,  "David S . Miller" <davem@...emloft.net>,  Eric
 Dumazet <edumazet@...gle.com>,  Jakub Kicinski <kuba@...nel.org>,  Flavio
 Crisciani <fcrisciani@...gle.com>,  "Theodore Y. Ts'o" <tytso@...gle.com>,
  Dmitry Torokhov <dmitry.torokhov@...il.com>
Subject: Re: [PATCH] net: sysctl: fix edge case wrt. sysctl write access

Maciej Żenczykowski <maze@...gle.com> writes:

> On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>
>> On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
>> > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
>> > grants write access to networking sysctls.
>> >
>> > However, it turns out there is an edge case where this is insufficient:
>> > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
>> > which can return -EACCES and thus block *all* write access.
>> >
>> > Note: AFAICT this check is wrt. the uid/gid mapping that was
>> > active at the time the filesystem (ie. proc) was mounted.
>> >
>> > In order for this check to not fail, we need net_ctl_set_ownership()
>> > to set valid uid/gid.  It is not immediately clear what value
>> > to use, nor what values are guaranteed to work.
>> > It does make sense that /proc/sys/net appear to be owned by root
>> > from within the netns owning userns.  As such we only modify
>> > what happens if the code fails to map uid/gid 0.
>> > Currently the code just fails to do anything, which in practice
>> > results in using the zeroes of freshly allocated memory,
>> > and we thus end up with global root.
>> > With this change we instead use the uid/gid of the owning userns.
>> > While it is probably (?) theoretically possible for this to *also*
>> > be unmapped from the /proc filesystem's point of view, this seems
>> > much less likely to happen in practice.
>> >
>> > The old code is observed to fail in a relatively complex setup,
>> > within a global root created user namespace with selectively
>> > mapped uid/gids (not including global root) and /proc mounted
>> > afterwards (so this /proc mount does not have global root mapped).
>> > Within this user namespace another non privileged task creates
>> > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
>> > and then creates a network namespace.  It cannot write to networking
>> > sysctls even though it does have CAP_NET_ADMIN.
>>
>> I'm wondering if this specific scenario should be considered a setup
>> issue, and should be solved with a different configuration? I would
>> love to hear others opinions!

It feels like a setup issue to me.  A different mount of proc can
always be used to set the sysctls if really needed.

>
> While it could be fixed in userspace.  I don't think it should:
>
> The global root uid/gid are very intentionally not mapped in (as a
> security feature).
> So that part isn't changeable (it's also a system daemon and not under
> user control).

Likewise the default for all sysctls is global uid 0 and global gid 0.

> The user namespace very intentionally maps uid->uid and not 0->uid.
> Here there's theoretically more leeway... because it is at least under
> user control.
> However here this is done for good reason as well.
> There's plenty of code that special cases uid=0, both in the kernel
> (for example capability handling across exec) and in various userspace
> libraries.  It's unrealistic to fix them all.

Ish.  Frankly in the kernel I have fixed them all a long time ago.
At least when the uids don't map straight through.

At least in the case when they don't make to the global uid 0 and global
gid 0.

> Additionally it's nice to have semi-transparent user namespaces,
> which are security barriers but don't remap uids - remapping causes confusion.
> (ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)

Ah.  So you are deliberately creating a this setup.

> As for why?  Because uids as visible to userspace may leak across user
> namespace boundaries,
> either when talking to other system daemons or when talking across machines.
> It's pretty easy (and common) to have uids that are globally unique
> and meaningful in a cluster of machines.
> Again, this is *theoretically* fixable in userspace, but not actually
> a realistic expectation.
>
> btw. even outside of clusters of machines, I also run some
> user/uts/net namespace using
> code on my personal desktop (this does require some minor hacks to
> unshare/mount binaries),
> and again I intentionally map uid->uid and 0->uid, because this makes
> my username show up as 'maze' and not 'root'.
>
> This is *clearly* a kernel bug that this doesn't just work.

No it is not *clearly* bug.

If the owning uid/gid is not mapped it is in general not safe to write
to an inode.  It is a nonsense scenario.

You have deliberately created a scenario where there is no uid 0 or
gid 0 to deliberately break things as a security feature and then you
are asking why things break?

As far as I can tell this is like locking your keys in the car, to make
certain no one can steal your car.  It works but it also makes it
difficult to get into your car and drive it.

> (side note: there's a very similar issue in proc_net.c which I haven't
> gotten around to fixing yet, because it looks to be more complex to
> convince oneself it's safe to do)

It is not a some much similar as the same.

Among other things there is a possibility that someone else has
deliberately used the inability to write to those sysctls without
the 0 uid and gid mapped.  If this is the case you are busy breaking
someone else's security.

As I recall the classic approach on nfs is to map uid 0 to a useless uid
like nobody.  I don't understand why something like that is not being
used in your case.

I don't see how deliberately crippling yourself by leaving 0 completely
unmapped gains anything of any value.  As such I don't understand
why you would like the kernel to have a special case to support your
use case.

Eric