[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2025-07-23.1753314869-silly-creamer-crushed-cabana-proper-jury-FaB28g@cyphar.com>
Date: Thu, 24 Jul 2025 09:55:05 +1000
From: Aleksa Sarai <cyphar@...har.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, Jonathan Corbet <corbet@....net>,
Shuah Khan <shuah@...nel.org>, linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-api@...r.kernel.org, linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible
On 2025-07-22, Aleksa Sarai <cyphar@...har.com> wrote:
> On 2025-07-21, Andy Lutomirski <luto@...capital.net> wrote:
> > On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@...har.com> wrote:
> > >
> > > Ever since the introduction of pid namespaces, procfs has had very
> > > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > > auto-selected based on the mounting process's active pidns, and the
> > > pidns itself is basically hidden once the mount has been constructed).
> > > This has historically meant that userspace was required to do some
> > > special dances in order to configure the pidns of a procfs mount as
> > > desired. Examples include:
> > >
> > > * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> > > a procfs mount from an empty pidns so that user namespaced containers
> > > can be nested (without this, the nested containers would fail to
> > > mount procfs). But this requires forking off a helper process because
> > > you cannot just one-shot this using mount(2).
> > >
> > > * Container runtimes in general need to fork into a container before
> > > configuring its mounts, which can lead to security issues in the case
> > > of shared-pidns containers (a privileged process in the pidns can
> > > interact with your container runtime process). While
> > > SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> > > strict need for this due to a minor uAPI wart is kind of unfortunate.
> > >
> > > Things would be much easier if there was a way for userspace to just
> > > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > > which can be set using fsconfig(2):
> > >
> > > fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> > > fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > >
> > > or classic mount(2) / mount(8):
> > >
> > > // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> > > mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > >
> > > The initial security model I have in this RFC is to be as conservative
> > > as possible and just mirror the security model for setns(2) -- which
> > > means that you can only set pidns=... to pid namespaces that your
> > > current pid namespace is a direct ancestor of. This fulfils the
> > > requirements of container runtimes, but I suspect that this may be too
> > > strict for some usecases.
> > >
> > > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > > what value it would make sense to show (maybe we could just use ns_dname
> > > to provide an identifier for the namespace, but this number would be
> > > fairly useless to userspace). I'm open to suggestions.
> > >
> > > In addition, being able to figure out what pid namespace is being used
> > > by a procfs mount is quite useful when you have an administrative
> > > process (such as a container runtime) which wants to figure out the
> > > correct way of mapping PIDs between its own namespace and the namespace
> > > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > > alternative ways to do this, but they all rely on ancillary information
> > > that third-party libraries and tools do not necessarily have access to.
> > >
> > > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > > can be used to get a reference to the pidns that a procfs is using.
> > >
> > > It's not quite clear what is the correct security model for this API,
> > > but the current approach I've taken is to:
> > >
> > > * Make the ioctl only valid on the root (meaning that a process without
> > > access to the procfs root -- such as only having an fd to a procfs
> > > file or some open_tree(2)-like subset -- cannot use this API).
> > >
> > > * Require that the process requesting either has access to
> > > /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
> > > pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
> > > access to it and can join it if they had a handle), or is in a pidns
> > > that is a direct ancestor of the target pidns (i.e. all of the pids
> > > are already visible in the procfs for the current process's pidns).
> >
> > What's the motivation for the ptrace-read option? While I don't see
> > an attack off the top of my head, it seems like creating a procfs
> > mount may give write-ish access to things in the pidns (because the
> > creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
> > access to namespace-wide things that aren't inherently visible to
> > PID1.
>
> This latter section is about the privilege model for
> ioctl(PROCFS_GET_PID_NAMESPACE), not the pidns= mount flag. pidns=
> requires CAP_SYS_ADMIN for pidns->user_ns, in addition to the same
> restrictions as pidns_install() (must be a direct ancestor). Maybe I
> should add some headers in this cover letter for v2...
>
> For the ioctl -- if the user can ptrace-read pid1 in the pidns, they can
> open a handle to /proc/1/ns/pid which is exactly the same thing they'd
> get from PROCFS_GET_PID_NAMESPACE.
>
> > Even the ancestor check seems dicey. Imagine that uid 1000 makes an
> > unprivileged container complete with a userns. Then uid 1001 (outside
> > the container) makes its own userns and mountns but stays in the init
> > pidns and then mounts (and owns, with all filesystem-related
> > capabilities) that mount. Is this really safe?
>
> As for the ancestor check (for the ioctl), the logic I had was that
> being in an ancestor pidns means that you already can see all of the
> subprocesses in your own pidns, so it seems strange to not be able to
> get a handle to their pidns. Maybe this isn't quite right, idk.
>
> Ultimately there isn't too much you can do with a pidns fd if you don't
> have privileges to join it (the only thing I can think of is that you
> could bind-mount it, which could maybe be used to trick an
> administrative process if they trusted your mountns for some reason).
>
> > CAP_SYS_ADMIN seems about right.
>
> For pidns=, sure. For the ioctl, I think this is overkill.
My bad, I forgot to add you to Cc for v2 Andy. PTAL:
<https://lore.kernel.org/all/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cyphar.com/>
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists