[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVo+Mdj7as2R0R+FqTBbjqwTkXu5Zkj=dg8EVM9xRhBPw@mail.gmail.com>
Date: Mon, 21 Jul 2025 07:54:25 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Aleksa Sarai <cyphar@...har.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible
On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@...har.com> wrote:
>
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> This has historically meant that userspace was required to do some
> special dances in order to configure the pidns of a procfs mount as
> desired. Examples include:
>
> * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> a procfs mount from an empty pidns so that user namespaced containers
> can be nested (without this, the nested containers would fail to
> mount procfs). But this requires forking off a helper process because
> you cannot just one-shot this using mount(2).
>
> * Container runtimes in general need to fork into a container before
> configuring its mounts, which can lead to security issues in the case
> of shared-pidns containers (a privileged process in the pidns can
> interact with your container runtime process). While
> SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> strict need for this due to a minor uAPI wart is kind of unfortunate.
>
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
>
> fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
>
> or classic mount(2) / mount(8):
>
> // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
>
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of. This fulfils the
> requirements of container runtimes, but I suspect that this may be too
> strict for some usecases.
>
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions.
>
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
>
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
>
> It's not quite clear what is the correct security model for this API,
> but the current approach I've taken is to:
>
> * Make the ioctl only valid on the root (meaning that a process without
> access to the procfs root -- such as only having an fd to a procfs
> file or some open_tree(2)-like subset -- cannot use this API).
>
> * Require that the process requesting either has access to
> /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
> pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
> access to it and can join it if they had a handle), or is in a pidns
> that is a direct ancestor of the target pidns (i.e. all of the pids
> are already visible in the procfs for the current process's pidns).
What's the motivation for the ptrace-read option? While I don't see
an attack off the top of my head, it seems like creating a procfs
mount may give write-ish access to things in the pidns (because the
creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
access to namespace-wide things that aren't inherently visible to
PID1.
Even the ancestor check seems dicey. Imagine that uid 1000 makes an
unprivileged container complete with a userns. Then uid 1001 (outside
the container) makes its own userns and mountns but stays in the init
pidns and then mounts (and owns, with all filesystem-related
capabilities) that mount. Is this really safe?
CAP_SYS_ADMIN seems about right.
--Andy
Powered by blists - more mailing lists