lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a92bca2-8789-c67d-c6bd-150506d9a476@yandex-team.ru>
Date:   Thu, 8 Mar 2018 22:51:08 +0300
From:   Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:     nagarathnam.muthusamy@...cle.com, linux-kernel@...r.kernel.org,
        linux-api@...r.kernel.org
Cc:     Nagarajan.Muthukrishnan@...cle.com, prakash.sangappa@...cle.com
Subject: Re: [RFC] translate_pid API

That's something over-engineered.

Instead of keeping all file descriptor you could remember
pid of init task and pid-ns inode number for validation.
And open pid-ns fd when needed.

On 06.03.2018 20:38, nagarathnam.muthusamy@...cle.com wrote:
> Following patch which is a variation of a solution discussed
> in https://lwn.net/Articles/736330/ provides the users of
> pid namespace, the functionality of pid translation between
> namespaces using a namespace identifier. The topic of
> pid translation has been discussed in the community few times
> but there has always been a resistance to adding new solution
> for this problem.
> 
> I will outline the planned usecase of pid namespace by oracle
> database and explain why any of the existing solution cannot
> be used to solve their problem.
> 
> Consider a system in which several PID namespaces with multiple
> nested levels exists in parallel with monitor processes managing
> all the namespaces. PID translation is required for controlling
> and accessing information about the processes by the monitors
> and other processes down the hierarchy of namespaces. Controlling
> primarily involves sending signals or using ptrace by a process in
> parent namespace on any of the processes in its child namespace.
> Accessing information deals with the reading /proc/<pid>/* files
> of processes in child namespace. None of the processes have
> root/CAP_SYS_ADMIN privileges.
> 
> Existing methods:
> 
> *) SCM_CREDENTIALS message:
> 	Ref: http://man7.org/linux/man-pages/man7/unix.7.html
> 	The sender can translate its pid from its own namespace
> to a pid in the target namespace by sending and receiving the
> SCM_CREDENTIALS message. The drawback of this method is the
> requirement of a socket communication channel to pid translation
> which adds to the management overhead. This method does not enable
> the sender to translate the pid of other process unless it is root
> or it has CAP_SYS_ADMIN.
> 
> *) /proc/<pid>/status file
> 	Ref: http://man7.org/linux/man-pages/man5/proc.5.html
> 	Ref: https://patchwork.kernel.org/patch/5861791/
> 	/proc/<pid>/status file provides a way to find the pids
> associated with a process in different namespaces. Pid translation
> from child namespace to parent namespace from parent namespace would
> require searching all the status file in the parent namespace to find
> the desired pid at desired level which is inefficient.
> 
> Method under review:
> 
> *) pidns: introduce syscall translate_pid
> 	Ref: https://lwn.net/Articles/736330/
> 	This solution though not integrated yet, when used would require
> monitor to have open file descriptor for every level of namespace it
> has to monitor which results in explosion of number of file descriptors
> the monitor has to keep open.
> 
> Following patch uses a 64-bit ID for namespace exported by procfs
> for pid translation through a new file /proc/<pid>/ns/pidns_id.
> 
> [root@...-x4600-01 ~]# ls -l /proc/4690/ns
> total 0
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 cgroup -> cgroup:[4026531835]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 ipc -> ipc:[4026531839]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 mnt -> mnt:[4026532324]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 net -> net:[4026531993]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 pid -> pid:[4026532325]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 pid_for_children -> pid:[4026532325]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 pidns_id -> [14994344851325148137]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 user -> user:[4026531837]
> lrwxrwxrwx 1 root root 0 Mar  1 15:55 uts -> uts:[4026531838]
> 
> Though there is a problem of ID being recycled in longterm, managing
> an ID per namespace is easier than having open file descriptors per
> pid namespace. Existing namespace inode numbers recycles faster and hence
> their usability as ID for this API is less.
> 
> Signed-off-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@...cle.com>
> ---
>   arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>   arch/x86/entry/syscalls/syscall_64.tbl |   2 +
>   fs/nsfs.c                              |   9 +-
>   fs/proc/namespaces.c                   |   1 +
>   include/linux/ns_common.h              |   1 +
>   include/linux/pid_namespace.h          |   3 +
>   include/linux/proc_ns.h                |   1 +
>   include/linux/syscalls.h               |   1 +
>   kernel/pid_namespace.c                 | 190 ++++++++++++++++++++++++++++++++-
>   kernel/sys_ni.c                        |   4 +
>   10 files changed, 208 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac21..31bf798 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>   382	i386	pkey_free		sys_pkey_free
>   383	i386	statx			sys_statx
>   384	i386	arch_prctl		sys_arch_prctl			compat_sys_arch_prctl
> +385	i386	translate_pid		sys_translate_pid		compat_sys_translate_pid
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183..89196c3 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>   330	common	pkey_alloc		sys_pkey_alloc
>   331	common	pkey_free		sys_pkey_free
>   332	common	statx			sys_statx
> +333	64	translate_pid		sys_translate_pid
>   
>   #
>   # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -380,3 +381,4 @@
>   545	x32	execveat		compat_sys_execveat/ptregs
>   546	x32	preadv2			compat_sys_preadv64v2
>   547	x32	pwritev2		compat_sys_pwritev64v2
> +548	x32	translate_pid		compat_sys_translate_pid
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 36b0772..c635465 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task,
>   	const char *name;
>   	ns = ns_ops->get(task);
>   	if (ns) {
> -		name = ns_ops->real_ns_name ? : ns_ops->name;
> -		res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> +		if (!strcmp(ns_ops->name, "pidns_id")) {
> +			res = snprintf(buf, size, "[%llu]",
> +				       (unsigned long long)ns->ns_id);
> +		} else {
> +			name = ns_ops->real_ns_name ? : ns_ops->name;
> +			res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
> +		}
>   		ns_ops->put(ns);
>   	}
>   	return res;
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 59b17e5..ac823ce 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -24,6 +24,7 @@
>   #endif
>   #ifdef CONFIG_PID_NS
>   	&pidns_operations,
> +	&pidns_id_operations,
>   	&pidns_for_children_operations,
>   #endif
>   #ifdef CONFIG_USER_NS
> diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
> index 5fbc400..6ca3d43 100644
> --- a/include/linux/ns_common.h
> +++ b/include/linux/ns_common.h
> @@ -8,6 +8,7 @@ struct ns_common {
>   	atomic_long_t stashed;
>   	const struct proc_ns_operations *ops;
>   	unsigned int inum;
> +	u64 ns_id;
>   };
>   
>   #endif
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 49538b1..11d1d57 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -11,6 +11,7 @@
>   #include <linux/kref.h>
>   #include <linux/ns_common.h>
>   #include <linux/idr.h>
> +#include <linux/list_bl.h>
>   
>   
>   struct fs_pin;
> @@ -44,6 +45,8 @@ struct pid_namespace {
>   	kgid_t pid_gid;
>   	int hide_pid;
>   	int reboot;	/* group exit code if this pidns was rebooted */
> +	struct hlist_bl_node node;
> +	atomic_t lookups_pending;
>   	struct ns_common ns;
>   } __randomize_layout;
>   
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index d31cb62..861e38bd 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -28,6 +28,7 @@ struct proc_ns_operations {
>   extern const struct proc_ns_operations utsns_operations;
>   extern const struct proc_ns_operations ipcns_operations;
>   extern const struct proc_ns_operations pidns_operations;
> +extern const struct proc_ns_operations pidns_id_operations;
>   extern const struct proc_ns_operations pidns_for_children_operations;
>   extern const struct proc_ns_operations userns_operations;
>   extern const struct proc_ns_operations mntns_operations;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a78186d..574349a 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -901,6 +901,7 @@ asmlinkage long sys_open_by_handle_at(int mountdirfd,
>   				      struct file_handle __user *handle,
>   				      int flags);
>   asmlinkage long sys_setns(int fd, int nstype);
> +asmlinkage long sys_translate_pid(pid_t pid, u64 source, u64 target);
>   asmlinkage long sys_process_vm_readv(pid_t pid,
>   				     const struct iovec __user *lvec,
>   				     unsigned long liovcnt,
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0b53eef..ff83aa8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -22,6 +22,12 @@
>   #include <linux/sched/task.h>
>   #include <linux/sched/signal.h>
>   #include <linux/idr.h>
> +#include <linux/random.h>
> +#include <linux/compat.h>
> +
> +#define	PID_NS_ID_HASH_BITS     9
> +
> +struct	hlist_bl_head *pid_ns_hash;
>   
>   struct pid_cache {
>   	int nr_ids;
> @@ -34,6 +40,13 @@ struct pid_cache {
>   static DEFINE_MUTEX(pid_caches_mutex);
>   static struct kmem_cache *pid_ns_cachep;
>   
> +static inline struct hlist_bl_head *
> +	pid_ns_hash_head(struct hlist_bl_head *hash,
> +			 uint64_t key)
> +{
> +	return &hash[hash_64(key, PID_NS_ID_HASH_BITS)];
> +}
> +
>   /*
>    * creates the kmem cache to allocate pids from.
>    * @nr_ids: the number of numerical ids this pid will have to carry
> @@ -93,12 +106,24 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
>   	dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
>   }
>   
> +static inline u64 get_namespace_id(void)
> +{
> +	u64 id = 0;
> +
> +	while (!id)
> +		get_random_bytes(&id, sizeof(id));
> +
> +	return id;
> +}
> +
>   static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
>   	struct pid_namespace *parent_pid_ns)
>   {
> -	struct pid_namespace *ns;
> -	unsigned int level = parent_pid_ns->level + 1;
> +	struct pid_namespace *ns, *pt;
>   	struct ucounts *ucounts;
> +	struct hlist_bl_head *head;
> +	struct hlist_bl_node *dup_node;
> +	unsigned int level = parent_pid_ns->level + 1;
>   	int err;
>   
>   	err = -EINVAL;
> @@ -135,7 +160,24 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>   	ns->ucounts = ucounts;
>   	ns->pid_allocated = PIDNS_ADDING;
>   	INIT_WORK(&ns->proc_work, proc_cleanup_work);
> -
> +	ns->ns.ns_id = get_namespace_id();
> +	while (1) {
> +		head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
> +		hlist_bl_lock(head);
> +		hlist_bl_for_each_entry(pt, dup_node, head, node) {
> +			if (ns->ns.ns_id == pt->ns.ns_id) {
> +				/*
> +				 * ID is taken. Move to next ID;
> +				 */
> +				ns->ns.ns_id = get_namespace_id();
> +				hlist_bl_unlock(head);
> +				continue;
> +			}
> +		}
> +		break;
> +	}
> +	hlist_bl_add_head(&ns->node, head);
> +	hlist_bl_unlock(head);
>   	return ns;
>   
>   out_free_idr:
> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p)
>   
>   static void destroy_pid_namespace(struct pid_namespace *ns)
>   {
> +	struct pid_namespace *ph;
> +	struct hlist_bl_head *head;
> +	struct hlist_bl_node *dup_node;
> +
> +	/*
> +	 * Remove the namespace structure from hash table so
> +	 * now new lookups can start on it.
> +	 */
> +	if (ns->ns.ns_id) {
> +		head = pid_ns_hash_head(pid_ns_hash, ns->ns.ns_id);
> +		hlist_bl_lock(head);
> +		hlist_bl_for_each_entry(ph, dup_node, head, node) {
> +			if (ns->ns.ns_id == ph->ns.ns_id) {
> +				hlist_bl_del_init(&ph->node);
> +				break;
> +			}
> +		}
> +		hlist_bl_unlock(head);
> +	}
> +	/*
> +	 * Wait for pending lookups to complete.
> +	 */
> +	while (atomic_read(&ns->lookups_pending))
> +		cpu_relax();
>   	ns_free_inum(&ns->ns);
>   
>   	idr_destroy(&ns->idr);
> @@ -463,6 +529,17 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>   	.get_parent	= pidns_get_parent,
>   };
>   
> +const struct proc_ns_operations pidns_id_operations = {
> +	.name		= "pidns_id",
> +	.real_ns_name	= "pid",
> +	.type		= CLONE_NEWPID,
> +	.get		= pidns_get,
> +	.put		= pidns_put,
> +	.install	= pidns_install,
> +	.owner		= pidns_owner,
> +	.get_parent	= pidns_get_parent,
> +};
> +
>   const struct proc_ns_operations pidns_for_children_operations = {
>   	.name		= "pid_for_children",
>   	.real_ns_name	= "pid",
> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>   	.get_parent	= pidns_get_parent,
>   };
>   
> +/*
> + * translate_pid - convert pid in source pid-ns into target pid-ns.
> + * @pid: pid for translation
> + * @source: pid-ns id
> + * @target: pid-ns id
> + *
> + * Return pid in @target pid-ns, zero if task have no pid there,
> + * or -ESRCH of task with @pid is not found in @source pid-ns.
> + */
> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source,
> +		u64, target)
> +{
> +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> +	struct pid *struct_pid;
> +	struct pid_namespace *ph;
> +	struct hlist_bl_head *shead = NULL;
> +	struct hlist_bl_head *thead = NULL;
> +	struct hlist_bl_node *dup_node;
> +	pid_t result;
> +
> +	if (!source) {
> +		source_ns = &init_pid_ns;
> +	} else {
> +		shead = pid_ns_hash_head(pid_ns_hash, source);
> +		hlist_bl_lock(shead);
> +		hlist_bl_for_each_entry(ph, dup_node, shead, node) {
> +			if (source == ph->ns.ns_id) {
> +				source_ns = ph;
> +				break;
> +			}
> +		}
> +		if (!source_ns) {
> +			hlist_bl_unlock(shead);
> +			return -EINVAL;
> +		}
> +	}
> +	if (!ptrace_may_access(source_ns->child_reaper,
> +			       PTRACE_MODE_READ_FSCREDS)) {
h> +		if (shead)
> +			hlist_bl_unlock(shead);
> +		return -EPERM;
> +	}
> +
> +	atomic_inc(&source_ns->lookups_pending);
> +	if (shead)
> +		hlist_bl_unlock(shead);
> +
> +	if (!target) {
> +		target_ns = &init_pid_ns;
> +	} else {
> +		thead = pid_ns_hash_head(pid_ns_hash, target);
> +		hlist_bl_lock(thead);
> +		hlist_bl_for_each_entry(ph, dup_node, thead, node) {
> +			if (target == ph->ns.ns_id) {
> +				target_ns = ph;
> +				break;
> +			}
> +		}
> +		if (!target_ns) {
> +			atomic_dec(&source_ns->lookups_pending);
> +			hlist_bl_unlock(thead);
> +			return -EINVAL;
> +		}
> +	}
> +	if (!ptrace_may_access(target_ns->child_reaper,
> +			       PTRACE_MODE_READ_FSCREDS)) {
> +		atomic_dec(&source_ns->lookups_pending);
> +		if (thead)
> +			hlist_bl_unlock(thead);
> +		return -EPERM;
> +	}
> +	atomic_inc(&target_ns->lookups_pending);
> +	if (thead)
> +		hlist_bl_unlock(thead);
> +
> +	rcu_read_lock();
> +	struct_pid = find_pid_ns(pid, source_ns);
> +	result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> +	rcu_read_unlock();
> +	atomic_dec(&source_ns->lookups_pending);
> +	atomic_dec(&target_ns->lookups_pending);
> +	return result;
> +}
> +
> +#ifdef	CONFIG_COMPAT
> +COMPAT_SYSCALL_DEFINE5(translate_pid, pid_t, pid, u32, s0, u32, s1,
> +		       u32, t0, u32, t1)
> +{
> +#ifdef __BIG_ENDIAN
> +	return sys_translate_pid(pid, ((u64)s0 << 32) | s1,
> +				 ((u64)t0 << 32) | t1);
> +#else
> +	return sys_translate_pid(pid, ((u64)s1 << 32) | s0,
> +				 ((u64)t1 << 32) | t0);
> +#endif
> +}
> +#endif
> +
>   static __init int pid_namespaces_init(void)
>   {
> +	unsigned long bucket_count;
> +	int i;
> +
> +	bucket_count = (1UL << PID_NS_ID_HASH_BITS);
>   	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_hash = kmalloc_array(bucket_count, sizeof(struct hlist_bl_head),
> +				    GFP_KERNEL);
> +
> +	for (i = 0; i < bucket_count; i++)
> +		INIT_HLIST_BL_HEAD(&pid_ns_hash[i]);
>   
>   #ifdef CONFIG_CHECKPOINT_RESTORE
>   	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index b518976..467255f 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -259,3 +259,7 @@ asmlinkage long sys_ni_syscall(void)
>   cond_syscall(sys_pkey_mprotect);
>   cond_syscall(sys_pkey_alloc);
>   cond_syscall(sys_pkey_free);
> +
> +/* pid translation */
> +cond_syscall(sys_translate_pid);
> +cond_syscall(compat_sys_translate_pid);
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ