netdev - Re: [PATCH] net: sysctl: fix edge case wrt. sysctl write access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANP3RGfk6PqR2P8HnGX92ODnf6V5iKb+_zjonOsTDOB-3odM5g@mail.gmail.com>
Date: Thu, 14 Dec 2023 14:17:44 +0100
From: Maciej Żenczykowski <maze@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Linux Network Development Mailing List <netdev@...r.kernel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Flavio Crisciani <fcrisciani@...gle.com>, "Theodore Y. Ts'o" <tytso@...gle.com>, 
	"Eric W. Biederman" <ebiederm@...ssion.com>, Dmitry Torokhov <dmitry.torokhov@...il.com>
Subject: Re: [PATCH] net: sysctl: fix edge case wrt. sysctl write access

On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> > grants write access to networking sysctls.
> >
> > However, it turns out there is an edge case where this is insufficient:
> > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> > which can return -EACCES and thus block *all* write access.
> >
> > Note: AFAICT this check is wrt. the uid/gid mapping that was
> > active at the time the filesystem (ie. proc) was mounted.
> >
> > In order for this check to not fail, we need net_ctl_set_ownership()
> > to set valid uid/gid.  It is not immediately clear what value
> > to use, nor what values are guaranteed to work.
> > It does make sense that /proc/sys/net appear to be owned by root
> > from within the netns owning userns.  As such we only modify
> > what happens if the code fails to map uid/gid 0.
> > Currently the code just fails to do anything, which in practice
> > results in using the zeroes of freshly allocated memory,
> > and we thus end up with global root.
> > With this change we instead use the uid/gid of the owning userns.
> > While it is probably (?) theoretically possible for this to *also*
> > be unmapped from the /proc filesystem's point of view, this seems
> > much less likely to happen in practice.
> >
> > The old code is observed to fail in a relatively complex setup,
> > within a global root created user namespace with selectively
> > mapped uid/gids (not including global root) and /proc mounted
> > afterwards (so this /proc mount does not have global root mapped).
> > Within this user namespace another non privileged task creates
> > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> > and then creates a network namespace.  It cannot write to networking
> > sysctls even though it does have CAP_NET_ADMIN.
>
> I'm wondering if this specific scenario should be considered a setup
> issue, and should be solved with a different configuration? I would
> love to hear others opinions!

While it could be fixed in userspace.  I don't think it should:

The global root uid/gid are very intentionally not mapped in (as a
security feature).
So that part isn't changeable (it's also a system daemon and not under
user control).

The user namespace very intentionally maps uid->uid and not 0->uid.
Here there's theoretically more leeway... because it is at least under
user control.
However here this is done for good reason as well.
There's plenty of code that special cases uid=0, both in the kernel
(for example capability handling across exec) and in various userspace
libraries.  It's unrealistic to fix them all.
Additionally it's nice to have semi-transparent user namespaces,
which are security barriers but don't remap uids - remapping causes confusion.
(ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)

As for why?  Because uids as visible to userspace may leak across user
namespace boundaries,
either when talking to other system daemons or when talking across machines.
It's pretty easy (and common) to have uids that are globally unique
and meaningful in a cluster of machines.
Again, this is *theoretically* fixable in userspace, but not actually
a realistic expectation.

btw. even outside of clusters of machines, I also run some
user/uts/net namespace using
code on my personal desktop (this does require some minor hacks to
unshare/mount binaries),
and again I intentionally map uid->uid and 0->uid, because this makes
my username show up as 'maze' and not 'root'.

This is *clearly* a kernel bug that this doesn't just work.
(side note: there's a very similar issue in proc_net.c which I haven't
gotten around to fixing yet, because it looks to be more complex to
convince oneself it's safe to do)

> > This is because net_ctl_set_ownership fails to map uid/gid 0
> > (because uid/gid 0 are *not* mapped in the owning 2nd level user_ns),
> > and falls back to global root.
> > But global root is not mapped in the 1st level user_ns,
> > which was inherited by the /proc mount, and thus fails...
> >
> > Note: the uid/gid of networking sysctls is of purely superficial
> > importance, outside of this UNMAPPED check, it does not actually
> > affect access, and only affects display.
> >
> > Access is always based on whether you are *global* root uid
> > (or have CAP_NET_ADMIN over the netns) for user write access bits
> > (or are in *global* root gid for group write access bits).
> >
> > Cc: Flavio Crisciani <fcrisciani@...gle.com>
> > Cc: "Theodore Y. Ts'o" <tytso@...gle.com>
> > Cc: "Eric W. Biederman" <ebiederm@...ssion.com>
> > Cc: Dmitry Torokhov <dmitry.torokhov@...il.com>
> > Fixes: e79c6a4fc923 ("net: make net namespace sysctls belong to container's owner")
> > Signed-off-by: Maciej Żenczykowski <maze@...gle.com>
> > ---
> >  net/sysctl_net.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> > index 051ed5f6fc93..ded399f380d9 100644
> > --- a/net/sysctl_net.c
> > +++ b/net/sysctl_net.c
> > @@ -58,16 +58,11 @@ static void net_ctl_set_ownership(struct ctl_table_header *head,
> >                                 kuid_t *uid, kgid_t *gid)
> >  {
> >       struct net *net = container_of(head->set, struct net, sysctls);
> > -     kuid_t ns_root_uid;
> > -     kgid_t ns_root_gid;
> > +     kuid_t ns_root_uid = make_kuid(net->user_ns, 0);
> > +     kgid_t ns_root_gid = make_kgid(net->user_ns, 0);
> >
> > -     ns_root_uid = make_kuid(net->user_ns, 0);
>
> As a fix I would prefer you would keep it minimal. e.g. just replace
> the if with the ternary operator or just add an 'else' branch.

If you think that's better, I'll resend a v2.

>
> Cheers,
>
> Paolo
>

--
Maciej Żenczykowski, Kernel Networking Developer @ Google