[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250723-procfs-pidns-api-v2-3-621e7edd8e40@cyphar.com>
Date: Wed, 23 Jul 2025 09:18:53 +1000
From: Aleksa Sarai <cyphar@...har.com>
To: Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
linux-kselftest@...r.kernel.org, Aleksa Sarai <cyphar@...har.com>
Subject: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.
In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.
So, add a new API for this in the form of an ioctl(2) you can call on
the root directory of procfs. The returned file descriptor will have
O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
option, finally allowing userspace full control of the pid namespaces
associated with procfs instances.
The permission model for this is a bit looser than that of the "pidns"
mount option, but this is mainly because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as privileges with CAP_SYS_ADMIN
or being in an ancestor pid namespace) it makes sense to allow userspace
to grab a handle. setns(2) will still have their own permission checks,
so being able to open a pidns handle doesn't really provide too many
other capabilities.
Signed-off-by: Aleksa Sarai <cyphar@...har.com>
---
Documentation/filesystems/proc.rst | 4 +++
fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++--
include/uapi/linux/fs.h | 3 +++
3 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c520b9f8a3fd..506383273c9d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
will be used by the procfs instance when translating pids. By default, procfs
will use the calling process's active pid namespace.
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
Chapter 5: Filesystem behavior
==============================
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 057c8a125c6e..548a57ec2152 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
#include <linux/cred.h>
#include <linux/magic.h>
#include <linux/slab.h>
+#include <linux/ptrace.h>
#include "internal.h"
+#include "../internal.h"
struct proc_fs_context {
struct pid_namespace *pid_ns;
@@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
return proc_pid_readdir(file, ctx);
}
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+#ifdef CONFIG_PID_NS
+ case PROCFS_GET_PID_NAMESPACE: {
+ struct pid_namespace *active = task_active_pid_ns(current);
+ struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+ bool can_access_pidns = false;
+
+ /*
+ * If we are in an ancestors of the pidns, or have join
+ * privileges (CAP_SYS_ADMIN), then it makes sense that we
+ * would be able to grab a handle to the pidns.
+ *
+ * Otherwise, if there is a root process, then being able to
+ * access /proc/$pid/ns/pid is equivalent to this ioctl and so
+ * we should probably match the permission model. For empty
+ * namespaces it seems unlikely for there to be a downside to
+ * allowing unprivileged users to open a handle to it (setns
+ * will fail for unprivileged users anyway).
+ */
+ can_access_pidns = pidns_is_ancestor(ns, active) ||
+ ns_capable(ns->user_ns, CAP_SYS_ADMIN);
+ if (!can_access_pidns) {
+ bool cannot_ptrace_pid1 = false;
+
+ read_lock(&tasklist_lock);
+ if (ns->child_reaper)
+ cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+ PTRACE_MODE_READ_FSCREDS);
+ read_unlock(&tasklist_lock);
+ can_access_pidns = !cannot_ptrace_pid1;
+ }
+ if (!can_access_pidns)
+ return -EPERM;
+
+ /* open_namespace() unconditionally consumes the reference. */
+ get_pid_ns(ns);
+ return open_namespace(to_ns_common(ns));
+ }
+#endif /* CONFIG_PID_NS */
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
/*
* The root /proc directory is special, as it has the
* <pid> directories. Thus we don't use the generic
* directory handling functions for that..
*/
static const struct file_operations proc_root_operations = {
- .read = generic_read_dir,
- .iterate_shared = proc_root_readdir,
+ .read = generic_read_dir,
+ .iterate_shared = proc_root_readdir,
.llseek = generic_file_llseek,
+ .unlocked_ioctl = proc_root_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
};
/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..aa642cb48feb 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
#define PROCFS_IOCTL_MAGIC 'f'
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1)
+
/* Pagemap ioctl */
#define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
--
2.50.0
Powered by blists - more mailing lists