[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2025-07-25.1753409614-vile-track-icky-epidemic-frail-antidote-d7NYuu@cyphar.com>
Date: Fri, 25 Jul 2025 12:24:28 +1000
From: Aleksa Sarai <cyphar@...har.com>
To: Christian Brauner <brauner@...nel.org>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
linux-kselftest@...r.kernel.org
Subject: Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
On 2025-07-24, Christian Brauner <brauner@...nel.org> wrote:
> On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > /proc has historically had very opaque semantics about PID namespaces,
> > which is a little unfortunate for container runtimes and other programs
> > that deal with switching namespaces very often. One common issue is that
> > of converting between PIDs in the process's namespace and PIDs in the
> > namespace of /proc.
> >
> > In principle, it is possible to do this today by opening a pidfd with
> > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > contain a PID value translated to the pid namespace associated with that
> > procfs superblock). However, allocating a new file for each PID to be
> > converted is less than ideal for programs that may need to scan procfs,
> > and it is generally useful for userspace to be able to finally get this
> > information from procfs.
> >
> > So, add a new API for this in the form of an ioctl(2) you can call on
> > the root directory of procfs. The returned file descriptor will have
> > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > option, finally allowing userspace full control of the pid namespaces
> > associated with procfs instances.
> >
> > The permission model for this is a bit looser than that of the "pidns"
> > mount option, but this is mainly because /proc/1/ns/pid provides the
> > same information, so as long as you have access to that magic-link (or
> > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > or being in an ancestor pid namespace) it makes sense to allow userspace
> > to grab a handle. setns(2) will still have their own permission checks,
> > so being able to open a pidns handle doesn't really provide too many
> > other capabilities.
> >
> > Signed-off-by: Aleksa Sarai <cyphar@...har.com>
> > ---
> > Documentation/filesystems/proc.rst | 4 +++
> > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fs.h | 3 +++
> > 3 files changed, 59 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index c520b9f8a3fd..506383273c9d 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > will be used by the procfs instance when translating pids. By default, procfs
> > will use the calling process's active pid namespace.
> >
> > +Processes can check which pid namespace is used by a procfs instance by using
> > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > +instance.
> > +
> > Chapter 5: Filesystem behavior
> > ==============================
> >
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index 057c8a125c6e..548a57ec2152 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -23,8 +23,10 @@
> > #include <linux/cred.h>
> > #include <linux/magic.h>
> > #include <linux/slab.h>
> > +#include <linux/ptrace.h>
> >
> > #include "internal.h"
> > +#include "../internal.h"
> >
> > struct proc_fs_context {
> > struct pid_namespace *pid_ns;
> > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > return proc_pid_readdir(file, ctx);
> > }
> >
> > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > +{
> > + switch (cmd) {
> > +#ifdef CONFIG_PID_NS
> > + case PROCFS_GET_PID_NAMESPACE: {
> > + struct pid_namespace *active = task_active_pid_ns(current);
> > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > + bool can_access_pidns = false;
> > +
> > + /*
> > + * If we are in an ancestors of the pidns, or have join
> > + * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > + * would be able to grab a handle to the pidns.
> > + *
> > + * Otherwise, if there is a root process, then being able to
> > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > + * we should probably match the permission model. For empty
> > + * namespaces it seems unlikely for there to be a downside to
> > + * allowing unprivileged users to open a handle to it (setns
> > + * will fail for unprivileged users anyway).
> > + */
> > + can_access_pidns = pidns_is_ancestor(ns, active) ||
> > + ns_capable(ns->user_ns, CAP_SYS_ADMIN);
>
> This seems to imply that if @ns is a descendant of @active that the
> caller holds privileges over it. Is that actually always true?
>
> IOW, why is the check different from the previous pidns= mount option
> check. I would've expected:
>
> ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
>
> and then the ptrace check as a fallback.
That would mirror pidns_install(), and I did think about it. The primary
(mostly handwave-y) reasoning I had for making it less strict was that:
* If you are in an ancestor pidns, then you can already see those
processes in your own /proc. In theory that means that you will be
able to access /proc/$pid/ns/pid for at least some subprocess there
(even if some subprocesses have SUID_DUMP_DISABLE, that flag is
cleared on ).
Though hypothetically if they are all running as a different user,
this does not apply (and you could create scenarios where a child
pidns is owned by a userns that you do not have privileges over -- if
you deal with setuid binaries). Maybe that risk means we should just
combine them, I'm not sure.
* If you have CAP_SYS_ADMIN permissions over the pidns, it seems
strange to disallow access even if it is not in an ancestor
namespace. This is distinct to pidns_install(), where you want to
ensure you cannot escape to a parent pid namespace, this is about
getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
Maybe they should be combined to match pidns_install(), but then I would
expect the ptrace_may_access() check to apply to all processes in the
pidns to make it less restrictive, which is not something you can
practically do (and there is a higher chance that pid1 will have
SUID_DUMP_DISABLE than some random subprocess, which almost certainly
will not be SUID_DUMP_DISABLE).
Fundamentally, I guess I'm still trying to see what the risk is of
allowing a process to get a handle to a pidns that they have some kind
of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being
able to see and address all processes in the namespace, or by being able
to open /proc/$pidns_pid1/ns/pid anyway) but cannot join.
Then again, maybe the fact that it is kind of strange to explain is
enough of a reason to just make it simpler...
> > + if (!can_access_pidns) {
> > + bool cannot_ptrace_pid1 = false;
> > +
> > + read_lock(&tasklist_lock);
> > + if (ns->child_reaper)
> > + cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
> > + PTRACE_MODE_READ_FSCREDS);
> > + read_unlock(&tasklist_lock);
> > + can_access_pidns = !cannot_ptrace_pid1;
> > + }
> > + if (!can_access_pidns)
> > + return -EPERM;
> > +
> > + /* open_namespace() unconditionally consumes the reference. */
> > + get_pid_ns(ns);
> > + return open_namespace(to_ns_common(ns));
> > + }
> > +#endif /* CONFIG_PID_NS */
> > + default:
> > + return -ENOIOCTLCMD;
> > + }
> > +}
> > +
> > /*
> > * The root /proc directory is special, as it has the
> > * <pid> directories. Thus we don't use the generic
> > * directory handling functions for that..
> > */
> > static const struct file_operations proc_root_operations = {
> > - .read = generic_read_dir,
> > - .iterate_shared = proc_root_readdir,
> > + .read = generic_read_dir,
> > + .iterate_shared = proc_root_readdir,
> > .llseek = generic_file_llseek,
> > + .unlocked_ioctl = proc_root_ioctl,
> > + .compat_ioctl = compat_ptr_ioctl,
> > };
> >
> > /*
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 0bd678a4a10e..aa642cb48feb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
> >
> > #define PROCFS_IOCTL_MAGIC 'f'
> >
> > +/* procfs root ioctls */
> > +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1)
> > +
> > /* Pagemap ioctl */
> > #define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
> >
> >
> > --
> > 2.50.0
> >
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists