[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANP3RGf3_TGz-DCYm3d3+3XaM5w9GL+KWLTOf5YtPRnqcWjLXg@mail.gmail.com>
Date: Thu, 14 Dec 2023 18:52:24 +0100
From: Maciej Żenczykowski <maze@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Linux Network Development Mailing List <netdev@...r.kernel.org>, "David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Flavio Crisciani <fcrisciani@...gle.com>, "Theodore Y. Ts'o" <tytso@...gle.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>, Dmitry Torokhov <dmitry.torokhov@...il.com>
Subject: Re: [PATCH] net: sysctl: fix edge case wrt. sysctl write access
On Thu, Dec 14, 2023 at 6:16 PM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Thu, 2023-12-14 at 14:17 +0100, Maciej Żenczykowski wrote:
> > On Thu, Dec 14, 2023 at 10:37 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > >
> > > On Sun, 2023-12-10 at 03:10 -0800, Maciej Żenczykowski wrote:
> > > > The clear intent of net_ctl_permissions() is that having CAP_NET_ADMIN
> > > > grants write access to networking sysctls.
> > > >
> > > > However, it turns out there is an edge case where this is insufficient:
> > > > inode_permission() has an additional check on HAS_UNMAPPED_ID(inode)
> > > > which can return -EACCES and thus block *all* write access.
> > > >
> > > > Note: AFAICT this check is wrt. the uid/gid mapping that was
> > > > active at the time the filesystem (ie. proc) was mounted.
> > > >
> > > > In order for this check to not fail, we need net_ctl_set_ownership()
> > > > to set valid uid/gid. It is not immediately clear what value
> > > > to use, nor what values are guaranteed to work.
> > > > It does make sense that /proc/sys/net appear to be owned by root
> > > > from within the netns owning userns. As such we only modify
> > > > what happens if the code fails to map uid/gid 0.
> > > > Currently the code just fails to do anything, which in practice
> > > > results in using the zeroes of freshly allocated memory,
> > > > and we thus end up with global root.
> > > > With this change we instead use the uid/gid of the owning userns.
> > > > While it is probably (?) theoretically possible for this to *also*
> > > > be unmapped from the /proc filesystem's point of view, this seems
> > > > much less likely to happen in practice.
> > > >
> > > > The old code is observed to fail in a relatively complex setup,
> > > > within a global root created user namespace with selectively
> > > > mapped uid/gids (not including global root) and /proc mounted
> > > > afterwards (so this /proc mount does not have global root mapped).
> > > > Within this user namespace another non privileged task creates
> > > > a new user namespace, maps it's own uid/gid (but not uid/gid 0),
> > > > and then creates a network namespace. It cannot write to networking
> > > > sysctls even though it does have CAP_NET_ADMIN.
> > >
> > > I'm wondering if this specific scenario should be considered a setup
> > > issue, and should be solved with a different configuration? I would
> > > love to hear others opinions!
> >
> > While it could be fixed in userspace. I don't think it should:
> >
> > The global root uid/gid are very intentionally not mapped in (as a
> > security feature).
> > So that part isn't changeable (it's also a system daemon and not under
> > user control).
> >
> > The user namespace very intentionally maps uid->uid and not 0->uid.
> > Here there's theoretically more leeway... because it is at least under
> > user control.
> > However here this is done for good reason as well.
> > There's plenty of code that special cases uid=0, both in the kernel
> > (for example capability handling across exec) and in various userspace
> > libraries. It's unrealistic to fix them all.
> > Additionally it's nice to have semi-transparent user namespaces,
> > which are security barriers but don't remap uids - remapping causes confusion.
> > (ie. the uid is either mapped or not, but if it is mapped it's a 1:1 mapping)
> >
> > As for why? Because uids as visible to userspace may leak across user
> > namespace boundaries,
> > either when talking to other system daemons or when talking across machines.
> > It's pretty easy (and common) to have uids that are globally unique
> > and meaningful in a cluster of machines.
> > Again, this is *theoretically* fixable in userspace, but not actually
> > a realistic expectation.
> >
> > btw. even outside of clusters of machines, I also run some
> > user/uts/net namespace using
> > code on my personal desktop (this does require some minor hacks to
> > unshare/mount binaries),
> > and again I intentionally map uid->uid and 0->uid, because this makes
obviously this was meant to say "and not 0->uid"
> > my username show up as 'maze' and not 'root'.
>
> I see, thanks for all the details.
>
> > This is *clearly* a kernel bug that this doesn't just work.
> > (side note: there's a very similar issue in proc_net.c which I haven't
> > gotten around to fixing yet, because it looks to be more complex to
> > convince oneself it's safe to do)
>
> Indeed the potential security related issue is the root cause of my
> concerns here. I could not identify any such problem, but I must admit
> the uid mapping is not the kernel I know better.
Oh, I'm not surprised: It took Flavio (as the affected user) and I
~two weeks or so (and dozens of back and forths with test binaries
with more and more logging and different approaches to doing things)
to figure out what was going wrong. Ultimately we only succeeded once
I managed to get the test case running on a semi-isolated dev machine
with the full userspace cluster stack and a printk instrumented
kernel. It was certainly a head-scratcher...
> I definitely could use another pair of eyeballs here ;)
>
> Cheers,
>
> Paolo
>
--
Maciej Żenczykowski, Kernel Networking Developer @ Google
Powered by blists - more mailing lists