[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20120120150108.GA9081@redhat.com>
Date: Fri, 20 Jan 2012 16:01:08 +0100
From: Oleg Nesterov <oleg@...hat.com>
To: Cyrill Gorcunov <gorcunov@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
KOSAKI Motohiro <kosaki.motohiro@...il.com>,
LKML <linux-kernel@...r.kernel.org>,
Pavel Emelyanov <xemul@...allels.com>,
Serge Hallyn <serge.hallyn@...onical.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Tejun Heo <tj@...nel.org>, Andrew Vagin <avagin@...nvz.org>,
Vasiliy Kulikov <segoon@...nwall.com>
Subject: Re: [PATCH] fs, proc: Introduce /proc/<pid>/task/<tid>/children
entry v7
On 01/19, Cyrill Gorcunov wrote:
>
> Here is refreshed one for sure.
Reviewed-by: Oleg Nesterov <oleg@...hat.com>
> From: Cyrill Gorcunov <gorcunov@...nvz.org>
> Subject: fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v8
>
> When we do checkpoint of a task we need to know the list of children
> the task, has but there is no easy and fast way to generate reverse
> parent->children chain from arbitrary <pid> (while a parent pid is
> provided in "PPid" field of /proc/<pid>/status).
>
> So instead of walking over all pids in the system (creating one big process
> tree in memory, just to figure out which children a task has) -- we add
> explicit /proc/<pid>/task/<tid>/children entry, because the kernel already has
> this kind of information but it is not yet exported.
>
> This is a first level children, not the whole process tree.
>
> v2:
> - Kame suggested to use a separated /proc/<pid>/children entry
> instead of poking /proc/<pid>/status
> - Andew suggested to use rcu facility instead of locking
> tasklist_lock
> - Tejun pointed that non-seekable seq file might not be
> enough for tasks with large number of children
>
> v3:
> - To be on a safe side use %lu format for pid_t printing
>
> v4:
> - New line get printed when sequence ends not at seq->stop,
> a nit pointed by Tejun
> - Documentation update
> - tasklist_lock is back, Oleg pointed that ->children list
> is actually not rcu-safe
>
> v5:
> - Oleg suggested to make /proc/<pid>/task/<tid>/children
> instead of global /proc/<pid>/children, which eliminates
> hardness related to threads and children migration, and
> allows patch to be a way simplier.
>
> v6:
> - Drop ptrace_may_access tests, pids are can be found anyway
> so nothing to protect here.
> - Update comments and docs, pointed by Oleg.
>
> v7:
> - Use get_pid over proc-pid directly, to simplify
> code, pointed by Oleg.
>
> v8:
> - Obtain a starting pid from the proc's inode directly.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@...nvz.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Pavel Emelyanov <xemul@...allels.com>
> Cc: Serge Hallyn <serge.hallyn@...onical.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> Cc: Oleg Nesterov <oleg@...hat.com>
> ---
> Documentation/filesystems/proc.txt | 18 +++++
> fs/proc/array.c | 123 +++++++++++++++++++++++++++++++++++++
> fs/proc/base.c | 1
> fs/proc/internal.h | 1
> 4 files changed, 143 insertions(+)
>
> Index: linux-2.6.git/Documentation/filesystems/proc.txt
> ===================================================================
> --- linux-2.6.git.orig/Documentation/filesystems/proc.txt
> +++ linux-2.6.git/Documentation/filesystems/proc.txt
> @@ -40,6 +40,7 @@ Table of Contents
> 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
> 3.5 /proc/<pid>/mountinfo - Information about mounts
> 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
> + 3.7 /proc/<pid>/task/<tid>/children - Information about task children
>
> 4 Configuring procfs
> 4.1 Mount options
> @@ -1549,6 +1550,23 @@ then the kernel's TASK_COMM_LEN (current
> comm value.
>
>
> +3.7 /proc/<pid>/task/<tid>/children - Information about task children
> +-------------------------------------------------------------------------
> +This file provides a fast way to retrieve first level children pids
> +of a task pointed by <pid>/<tid> pair. The format is a space separated
> +stream of pids.
> +
> +Note the "first level" here -- if a child has own children they will
> +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
> +to obtain the descendants.
> +
> +Since this interface is intended to be fast and cheap it doesn't
> +guarantee to provide precise results and some children might be
> +skipped, especially if they've exited right after we printed their
> +pids, so one need to either stop or freeze processes being inspected
> +if precise results are needed.
> +
> +
> ------------------------------------------------------------------------------
> Configuring procfs
> ------------------------------------------------------------------------------
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -547,3 +547,126 @@ int proc_pid_statm(struct seq_file *m, s
>
> return 0;
> }
> +
> +static struct pid *
> +get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
> +{
> + struct task_struct *start, *task;
> + struct pid *pid = NULL;
> +
> + read_lock(&tasklist_lock);
> +
> + start = pid_task(proc_pid(inode), PIDTYPE_PID);
> + if (!start)
> + goto out;
> +
> + /*
> + * Lets try to continue searching first, this gives
> + * us significant speedup on children-rich processes.
> + */
> + if (pid_prev) {
> + task = pid_task(pid_prev, PIDTYPE_PID);
> + if (task && task->real_parent == start &&
> + !(list_empty(&task->sibling))) {
> + if (list_is_last(&task->sibling, &start->children))
> + goto out;
> + task = list_first_entry(&task->sibling,
> + struct task_struct, sibling);
> + pid = get_pid(task_pid(task));
> + goto out;
> + }
> + }
> +
> + /*
> + * Slow search case.
> + *
> + * We might miss some children here if children
> + * are exited while we were not holding the lock,
> + * but it was never promised to be accurate that
> + * much.
> + *
> + * "Just suppose that the parent sleeps, but N children
> + * exit after we printed their tids. Now the slow paths
> + * skips N extra children, we miss N tasks." (c)
> + *
> + * So one need to stop or freeze the leader and all
> + * its children to get a precise result.
> + */
> + list_for_each_entry(task, &start->children, sibling) {
> + if (pos-- == 0) {
> + pid = get_pid(task_pid(task));
> + break;
> + }
> + }
> +
> +out:
> + read_unlock(&tasklist_lock);
> + return pid;
> +}
> +
> +static int children_seq_show(struct seq_file *seq, void *v)
> +{
> + struct inode *inode = seq->private;
> + pid_t pid;
> +
> + pid = pid_nr_ns(v, inode->i_sb->s_fs_info);
> + return seq_printf(seq, " %d", pid);
> +}
> +
> +static void *children_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> + return get_children_pid(seq->private, NULL, *pos);
> +}
> +
> +static void *children_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + struct pid *pid = NULL;
> +
> + pid = get_children_pid(seq->private, v, *pos + 1);
> + if (!pid)
> + seq_printf(seq, "\n");
> + put_pid(v);
> +
> + ++*pos;
> + return pid;
> +}
> +
> +static void children_seq_stop(struct seq_file *seq, void *v)
> +{
> + put_pid(v);
> +}
> +
> +static const struct seq_operations children_seq_ops = {
> + .start = children_seq_start,
> + .next = children_seq_next,
> + .stop = children_seq_stop,
> + .show = children_seq_show,
> +};
> +
> +static int children_seq_open(struct inode *inode, struct file *file)
> +{
> + struct seq_file *m;
> + int ret;
> +
> + ret = seq_open(file, &children_seq_ops);
> + if (ret)
> + return ret;
> +
> + m = file->private_data;
> + m->private = inode;
> +
> + return ret;
> +}
> +
> +int children_seq_release(struct inode *inode, struct file *file)
> +{
> + seq_release(inode, file);
> + return 0;
> +}
> +
> +const struct file_operations proc_tid_children_operations = {
> + .open = children_seq_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = children_seq_release,
> +};
> Index: linux-2.6.git/fs/proc/base.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/base.c
> +++ linux-2.6.git/fs/proc/base.c
> @@ -3454,6 +3454,7 @@ static const struct pid_entry tid_base_s
> ONE("stat", S_IRUGO, proc_tid_stat),
> ONE("statm", S_IRUGO, proc_pid_statm),
> REG("maps", S_IRUGO, proc_maps_operations),
> + REG("children", S_IRUGO, proc_tid_children_operations),
> #ifdef CONFIG_NUMA
> REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
> #endif
> Index: linux-2.6.git/fs/proc/internal.h
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/internal.h
> +++ linux-2.6.git/fs/proc/internal.h
> @@ -53,6 +53,7 @@ extern int proc_pid_statm(struct seq_fil
> struct pid *pid, struct task_struct *task);
> extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
>
> +extern const struct file_operations proc_tid_children_operations;
> extern const struct file_operations proc_maps_operations;
> extern const struct file_operations proc_numa_maps_operations;
> extern const struct file_operations proc_smaps_operations;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists