linux-kernel - Re: [PATCH v4 2/4] procfs: add "pidns" mount option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2025-08-05.1754378656-steep-harps-muscled-mailroom-lively-gosling-VVGNTP@cyphar.com>
Date: Tue, 5 Aug 2025 17:29:07 +1000
From: Aleksa Sarai <cyphar@...har.com>
To: Alexander Viro <viro@...iv.linux.org.uk>, 
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, Jonathan Corbet <corbet@....net>, 
	Shuah Khan <shuah@...nel.org>
Cc: Andy Lutomirski <luto@...capital.net>, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org, linux-doc@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, Amir Goldstein <amir73il@...il.com>
Subject: Re: [PATCH v4 2/4] procfs: add "pidns" mount option

On 2025-08-05, Aleksa Sarai <cyphar@...har.com> wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> In order to avoid having to RCU-protect all users of proc_pid_ns() (to
> avoid UAFs), attempting to reconfigure an existing procfs instance's pid
> namespace will error out with -EBUSY. Creating new procfs instances is
> quite cheap, so this should not be an impediment to most users, and lets
> us avoid a lot of churn in fs/proc/* for a feature that it seems
> unlikely userspace would use.
> 
> Signed-off-by: Aleksa Sarai <cyphar@...har.com>
> ---
>  Documentation/filesystems/proc.rst |  8 ++++
>  fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..5a157dadea0b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.
>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace. Note that the pid
> +namespace of an existing procfs instance cannot be modified (attempting to do
> +so will give an `-EBUSY` error).
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index ed86ac710384..fd1f1c8a939a 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -38,12 +38,14 @@ enum proc_param {
>  	Opt_gid,
>  	Opt_hidepid,
>  	Opt_subset,
> +	Opt_pidns,
>  };
>  
>  static const struct fs_parameter_spec proc_fs_parameters[] = {
> -	fsparam_u32("gid",	Opt_gid),
> +	fsparam_u32("gid",		Opt_gid),
>  	fsparam_string("hidepid",	Opt_hidepid),
>  	fsparam_string("subset",	Opt_subset),
> +	fsparam_file_or_string("pidns",	Opt_pidns),
>  	{}
>  };
>  
> @@ -109,11 +111,66 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PID_NS
> +static int proc_parse_pidns_param(struct fs_context *fc,
> +				  struct fs_parameter *param,
> +				  struct fs_parse_result *result)
> +{
> +	struct proc_fs_context *ctx = fc->fs_private;
> +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> +	struct ns_common *ns;
> +	struct file *ns_filp __free(fput) = NULL;
> +
> +	switch (param->type) {
> +	case fs_value_is_file:
> +		/* came through fsconfig, steal the file reference */
> +		ns_filp = no_free_ptr(param->file);
> +		break;
> +	case fs_value_is_string:
> +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> +		break;

I just realised that we probably also want to support FSCONFIG_SET_PATH
here, but fsparam_file_or_string() doesn't handle that at the moment. I
think we probably want to have fsparam_file_or_path() which would act
like:

 1. A path with FSCONFIG_SET_STRING and FSCONFIG_SET_PATH.
 2. A file with FSCONFIG_SET_FD.

These are the semantics I would already expect from these kinds of
flags, but at the moment FSCONFIG_SET_PATH is entirely disallowed.

@Amir:

I wonder if overlayfs (the only other user of fsparam_file_or_string())
would also prefer having these semantics? We could just migrate
fsparam_file_or_string() to fsparam_file_or_path() everwhere, since I'm
pretty sure these are the semantics userspace expects anyway.

> +	default:
> +		WARN_ON_ONCE(true);
> +		break;
> +	}
> +	if (!ns_filp)
> +		ns_filp = ERR_PTR(-EBADF);
> +	if (IS_ERR(ns_filp)) {
> +		errorfc(fc, "could not get file from pidns argument");
> +		return PTR_ERR(ns_filp);
> +	}
> +
> +	if (!proc_ns_file(ns_filp))
> +		return invalfc(fc, "pidns argument is not an nsfs file");
> +	ns = get_proc_ns(file_inode(ns_filp));
> +	if (ns->ops->type != CLONE_NEWPID)
> +		return invalfc(fc, "pidns argument is not a pidns file");
> +	target = container_of(ns, struct pid_namespace, ns);
> +
> +	/*
> +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> +	 * permission model should be the same as pidns_install().
> +	 */
> +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> +		errorfc(fc, "insufficient permissions to set pidns");
> +		return -EPERM;
> +	}
> +	if (!pidns_is_ancestor(target, active))
> +		return invalfc(fc, "cannot set pidns to non-descendant pidns");
> +
> +	put_pid_ns(ctx->pid_ns);
> +	ctx->pid_ns = get_pid_ns(target);
> +	put_user_ns(fc->user_ns);
> +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> +	return 0;
> +}
> +#endif /* CONFIG_PID_NS */
> +
>  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
>  	struct proc_fs_context *ctx = fc->fs_private;
>  	struct fs_parse_result result;
> -	int opt;
> +	int opt, err;
>  
>  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
>  	if (opt < 0)
> @@ -125,14 +182,38 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		break;
>  
>  	case Opt_hidepid:
> -		if (proc_parse_hidepid_param(fc, param))
> -			return -EINVAL;
> +		err = proc_parse_hidepid_param(fc, param);
> +		if (err)
> +			return err;
>  		break;
>  
>  	case Opt_subset:
> -		if (proc_parse_subset_param(fc, param->string) < 0)
> -			return -EINVAL;
> +		err = proc_parse_subset_param(fc, param->string);
> +		if (err)
> +			return err;
> +		break;
> +
> +	case Opt_pidns:
> +#ifdef CONFIG_PID_NS
> +		/*
> +		 * We would have to RCU-protect every proc_pid_ns() or
> +		 * proc_sb_info() access if we allowed this to be reconfigured
> +		 * for an existing procfs instance. Luckily, procfs instances
> +		 * are cheap to create, and mount-beneath would let you
> +		 * atomically replace an instance even with overmounts.
> +		 */
> +		if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
> +			errorfc(fc, "cannot reconfigure pidns for existing procfs");
> +			return -EBUSY;
> +		}
> +		err = proc_parse_pidns_param(fc, param, &result);
> +		if (err)
> +			return err;
>  		break;
> +#else
> +		errorfc(fc, "pidns mount flag not supported on this system");
> +		return -EOPNOTSUPP;
> +#endif
>  
>  	default:
>  		return -EINVAL;
> @@ -154,6 +235,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
>  		fs_info->hide_pid = ctx->hidepid;
>  	if (ctx->mask & (1 << Opt_subset))
>  		fs_info->pidonly = ctx->pidonly;
> +	if (ctx->mask & (1 << Opt_pidns) &&
> +	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
> +		put_pid_ns(fs_info->pid_ns);
> +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> +	}
>  }
>  
>  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> 
> -- 
> 2.50.1
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)