linux-kernel - Re: [PATCH 1/3] will_become_orphaned

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m1zlwj2zj2.fsf@ebiederm.dsl.xmission.com>
Date:	Sun, 09 Dec 2007 16:56:17 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Oleg Nesterov <oleg@...sign.ru>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Davide Libenzi <davidel@...ilserver.org>,
	Ingo Molnar <mingo@...e.hu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Roland McGrath <roland@...hat.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] will_become_orphaned_pgrp: we have threads

Oleg Nesterov <oleg@...sign.ru> writes:

> On 12/09, Eric W. Biederman wrote:
>>
>> Equally messed up is a our status in /proc at that point.  Which
>> says our sleeping process is a zombie.
>
> Yes, this is annoying.
>
>> I'm thinking we need to do at least some of the thread group leadership
>> transfer in do_exit, instead of de_thread.  Then p->group_leader->exit_state
>> would be sufficient to see if the entire thread group was alive,
>> as the group_leader would be whoever was left alive.  The original
>> group_leader might still need to be kept around for it's pid...
>>
>> I think that would solve most of the problems you have with a dead
>> thread group leader and sending SIG_STOP as well.
>
> Yes I was thinking about that too, but I am not brave enough to even
> try to to think to the end ;)
>
> As a minimal change, I tried to add "task_struct *leader_proxy" to
> signal_struct, which points to the next live thread, and changed by
> exit_notify(). eligible_child() checks it instead of ->exit_signal.
> But this is so messy...
>
> And in fact, if we are talking about group stop, it is a group operation,
> why do_wait() uses per-thread ->exit_code but not ->group_exit_code ?
Good question, we would need a fallback for the case it isn't a group
operation like in exit but that might clean something up.

> But yes, [PATCH 3/3] adds a visible difference, and I don't know if
> this difference is good or bad.
>
> 	$ sleep 1000
>
> 	[1]+  Stopped                 sleep 1000
> 	$ strace -p `pidof sleep`
> 	Process 432 attached - interrupt to quit
>
> Now strace "hangs" in do_wait() because ->exit_code was eaten by the
> shell. We need SIGCONT.
>
> With the "[PATCH 3/3]" strace proceeds happily.
>
> Oleg.

Well I got to playing with the idea of actually moving group_leader
and it turns out that while it is a pain it isn't actually that bad.
The worst part is not really changing the pid of the leader to the pid
of the entire thread group.  As there are a few cases where we are
current referencing the task_pid where we really want task_tgid.

Oleg below is my proof of concept patch, which really needs to be
broken up into a whole patch series, so the changes are small
enough we can do a thorough audit on them.  Anyway take a look
and see what you think.

This patch does fix your weird test case without any actual
change to the do_wait logic itself.

The key idea is that by not making PIDTYPE_PID a hash chain we
can point two different struct pids at the same process allowing
two different pids to return the same process from pid_task(pid,
PIDTYPE_PID);

Which means most things continue to just work by working on
PIDTYPE_PID, although as mentioned previously there are a few
things particulary do_notify_parent_cldstop and do_wait that
need to be changed to return the tgid instead of the pid.

Oh and in eligible child the PIDTYPE_PID test is now sneaky
essentially doing a task lookup and seeing if the result
is our target pid, instead of comparing pids.

The funny part is grep pid /proc/<tgid>/status no longer
always equals the tgid after the pid exits.  Still that seems
better then making the entire thread group look like a zombie
just because the wrong thread exited.

Subject: [PATCH] All thread group leaders to exit

---
 fs/exec.c                 |   81 ++---------------------
 fs/fcntl.c                |   20 ++++--
 fs/proc/base.c            |    6 +-
 include/linux/init_task.h |   25 ++++----
 include/linux/pid.h       |   14 ++---
 include/linux/sched.h     |   43 ++++++------
 kernel/exit.c             |  157 +++++++++++++++++++++++++++-----------------
 kernel/fork.c             |    2 +-
 kernel/itimer.c           |    2 +-
 kernel/pid.c              |   60 +++++++++++------
 kernel/signal.c           |   23 ++-----
 11 files changed, 204 insertions(+), 229 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 14a690d..1f69326 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -786,22 +786,6 @@ static int de_thread(struct task_struct *tsk)
 	 * Account for the thread group leader hanging around:
 	 */
 	count = 1;
-	if (!thread_group_leader(tsk)) {
-		count = 2;
-		/*
-		 * The SIGALRM timer survives the exec, but needs to point
-		 * at us as the new group leader now.  We have a race with
-		 * a timer firing now getting the old leader, so we need to
-		 * synchronize with any firing (by calling del_timer_sync)
-		 * before we can safely let the old group leader die.
-		 */
-		sig->tsk = tsk;
-		spin_unlock_irq(lock);
-		if (hrtimer_cancel(&sig->real_timer))
-			hrtimer_restart(&sig->real_timer);
-		spin_lock_irq(lock);
-	}
-
 	sig->notify_count = count;
 	while (atomic_read(&sig->count) > count) {
 		__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -811,68 +795,15 @@ static int de_thread(struct task_struct *tsk)
 	}
 	spin_unlock_irq(lock);
 
-	/*
-	 * At this point all other threads have exited, all we have to
-	 * do is to wait for the thread group leader to become inactive,
-	 * and to assume its PID:
-	 */
-	if (!thread_group_leader(tsk)) {
-		leader = tsk->group_leader;
-
-		sig->notify_count = -1;
-		for (;;) {
-			write_lock_irq(&tasklist_lock);
-			if (likely(leader->exit_state))
-				break;
-			__set_current_state(TASK_UNINTERRUPTIBLE);
-			write_unlock_irq(&tasklist_lock);
-			schedule();
+	/* If it isn't already force gettid() == getpid() */
+	if (sig->tgid != tsk->tid) {
+		write_lock_irq(&tasklist_lock);
+		if (sig->tgid != tsk->tid) {
+			detach_pid(tsk, PIDTYPE_PID);
+			attach_pid(tsk, PIDTYPE_PID, sig->tgid);
 		}
-
-		/*
-		 * The only record we have of the real-time age of a
-		 * process, regardless of execs it's done, is start_time.
-		 * All the past CPU time is accumulated in signal_struct
-		 * from sister threads now dead.  But in this non-leader
-		 * exec, nothing survives from the original leader thread,
-		 * whose birth marks the true age of this process now.
-		 * When we take on its identity by switching to its PID, we
-		 * also take its birthdate (always earlier than our own).
-		 */
-		tsk->start_time = leader->start_time;
-
-		BUG_ON(!same_thread_group(leader, tsk));
-		BUG_ON(has_group_leader_pid(tsk));
-		/*
-		 * An exec() starts a new thread group with the
-		 * TGID of the previous thread group. Rehash the
-		 * two threads with a switched PID, and release
-		 * the former thread group leader:
-		 */
-
-		/* Become a process group leader with the old leader's pid.
-		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
-		 */
-		detach_pid(tsk, PIDTYPE_PID);
-		tsk->pid = leader->pid;
-		attach_pid(tsk, PIDTYPE_PID,  task_pid(leader));
-		transfer_pid(leader, tsk, PIDTYPE_PGID);
-		transfer_pid(leader, tsk, PIDTYPE_SID);
-		list_replace_rcu(&leader->tasks, &tsk->tasks);
-
-		tsk->group_leader = tsk;
-		leader->group_leader = tsk;
-
-		tsk->exit_signal = SIGCHLD;
-
-		BUG_ON(leader->exit_state != EXIT_ZOMBIE);
-		leader->exit_state = EXIT_DEAD;
-
 		write_unlock_irq(&tasklist_lock);
 	}
-
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
 
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 8685263..bc0a125 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -516,9 +516,13 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 		goto out_unlock_fown;
 	
 	read_lock(&tasklist_lock);
-	do_each_pid_task(pid, type, p) {
-		send_sigio_to_task(p, fown, fd, band);
-	} while_each_pid_task(pid, type, p);
+	if (type == PIDTYPE_PID)
+		send_sigio_to_task(pid_task(pid, type), fown, fd, band);
+	else {
+		do_each_pid_task(pid, type, p) {
+			send_sigio_to_task(p, fown, fd, band);
+		} while_each_pid_task(pid, type, p);
+	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
 	read_unlock(&fown->lock);
@@ -547,9 +551,13 @@ int send_sigurg(struct fown_struct *fown)
 	ret = 1;
 	
 	read_lock(&tasklist_lock);
-	do_each_pid_task(pid, type, p) {
-		send_sigurg_to_task(p, fown);
-	} while_each_pid_task(pid, type, p);
+	if (type == PIDTYPE_PID)
+		send_sigurg_to_task(pid_task(pid, type), fown);
+	else {
+		do_each_pid_task(pid, type, p) {
+			send_sigurg_to_task(p, fown);
+		} while_each_pid_task(pid, type, p);
+	}
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
 	read_unlock(&fown->lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d59708e..f7bd620 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2438,15 +2438,15 @@ retry:
 		 * pid of a thread_group_leader.  Testing for task
 		 * being a thread_group_leader is the obvious thing
 		 * todo but there is a window when it fails, due to
-		 * the pid transfer logic in de_thread.
+		 * the pid transfer logic at group leader death.
 		 *
 		 * So we perform the straight forward test of seeing
-		 * if the pid we have found is the pid of a thread
+		 * if the pid we have found is the pid of the thread
 		 * group leader, and don't worry if the task we have
 		 * found doesn't happen to be a thread group leader.
 		 * As we don't care in the case of readdir.
 		 */
-		if (!iter.task || !has_group_leader_pid(iter.task)) {
+		if (!iter.task || pid != task_tgid(iter.task)) {
 			iter.tgid += 1;
 			goto retry;
 		}
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 96be7d6..ddcd7c1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -67,6 +67,9 @@
 	.posix_timers	 = LIST_HEAD_INIT(sig.posix_timers),		\
 	.cpu_timers	= INIT_CPU_TIMERS(sig.cpu_timers),		\
 	.rlim		= INIT_RLIMITS,					\
+	.tgid		= &init_struct_pid,				\
+	.pids[PIDTYPE_PGID]	= &init_struct_pid,			\
+	.pids[PIDTYPE_SID]	= &init_struct_pid,			\
 }
 
 extern struct nsproxy init_nsproxy;
@@ -91,10 +94,10 @@ extern struct group_info init_groups;
 
 #define INIT_STRUCT_PID {						\
 	.count 		= ATOMIC_INIT(1),				\
+	.tsk		= &init_task,					\
 	.tasks		= {						\
-		{ .first = &init_task.pids[PIDTYPE_PID].node },		\
-		{ .first = &init_task.pids[PIDTYPE_PGID].node },	\
-		{ .first = &init_task.pids[PIDTYPE_SID].node },		\
+		{ .first = &init_task.pids[PIDTYPE_PGID] },		\
+		{ .first = &init_task.pids[PIDTYPE_SID] },		\
 	},								\
 	.rcu		= RCU_HEAD_INIT,				\
 	.level		= 0,						\
@@ -105,13 +108,10 @@ extern struct group_info init_groups;
 	}, }								\
 }
 
-#define INIT_PID_LINK(type) 					\
-{								\
-	.node = {						\
-		.next = NULL,					\
-		.pprev = &init_struct_pid.tasks[type].first,	\
-	},							\
-	.pid = &init_struct_pid,				\
+#define INIT_PID_HLIST_NODE(type) 			\
+{							\
+	.next = NULL,					\
+	.pprev = &init_struct_pid.tasks[type].first,	\
 }
 
 #ifdef CONFIG_SECURITY_FILE_CAPABILITIES
@@ -179,9 +179,8 @@ extern struct group_info init_groups;
 	.fs_excl	= ATOMIC_INIT(0),				\
 	.pi_lock	= __SPIN_LOCK_UNLOCKED(tsk.pi_lock),		\
 	.pids = {							\
-		[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),		\
-		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),		\
-		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),		\
+		[PIDTYPE_PGID] = INIT_PID_HLIST_NODE(PIDTYPE_PGID),	\
+		[PIDTYPE_SID]  = INIT_PID_HLIST_NODE(PIDTYPE_SID),	\
 	},								\
 	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_TRACE_IRQFLAGS						\
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 061abb6..828355e 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -5,9 +5,10 @@
 
 enum pid_type
 {
-	PIDTYPE_PID,
 	PIDTYPE_PGID,
 	PIDTYPE_SID,
+	PIDTYPE_PID,
+#define PIDTYPE_ARRAY_MAX PIDTYPE_PID
 	PIDTYPE_MAX
 };
 
@@ -58,7 +59,8 @@ struct pid
 {
 	atomic_t count;
 	/* lists of tasks that use this pid */
-	struct hlist_head tasks[PIDTYPE_MAX];
+	struct task_struct *tsk;
+	struct hlist_head tasks[PIDTYPE_ARRAY_MAX];
 	struct rcu_head rcu;
 	int level;
 	struct upid numbers[1];
@@ -66,12 +68,6 @@ struct pid
 
 extern struct pid init_struct_pid;
 
-struct pid_link
-{
-	struct hlist_node node;
-	struct pid *pid;
-};
-
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
@@ -158,7 +154,7 @@ static inline pid_t pid_vnr(struct pid *pid)
 		struct hlist_node *pos___;				\
 		if (pid != NULL)					\
 			hlist_for_each_entry_rcu((task), pos___,	\
-				&pid->tasks[type], pids[type].node) {
+				&pid->tasks[type], pids[type]) {
 
 #define while_each_pid_task(pid, type, task)				\
 			}						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b1e25b..496dfda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -453,7 +453,6 @@ struct signal_struct {
 
 	/* ITIMER_REAL timer for the process */
 	struct hrtimer real_timer;
-	struct task_struct *tsk;
 	ktime_t it_real_incr;
 
 	/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
@@ -461,6 +460,8 @@ struct signal_struct {
 	cputime_t it_prof_incr, it_virt_incr;
 
 	/* job control IDs */
+	struct pid *tgid;
+	struct pid *pids[PIDTYPE_ARRAY_MAX];
 
 	/*
 	 * pgrp and session fields are deprecated.
@@ -1034,8 +1035,9 @@ struct task_struct {
 	struct list_head sibling;	/* linkage in my parent's children list */
 	struct task_struct *group_leader;	/* threadgroup leader */
 
+	struct pid *tid;
 	/* PID/PID hash table linkage. */
-	struct pid_link pids[PIDTYPE_MAX];
+	struct hlist_node pids[PIDTYPE_ARRAY_MAX];
 	struct list_head thread_group;
 
 	struct completion *vfork_done;		/* for vfork() */
@@ -1261,22 +1263,34 @@ static inline void set_task_pgrp(struct task_struct *tsk, pid_t pgrp)
 
 static inline struct pid *task_pid(struct task_struct *task)
 {
-	return task->pids[PIDTYPE_PID].pid;
+	return task->tid;
 }
 
 static inline struct pid *task_tgid(struct task_struct *task)
 {
-	return task->group_leader->pids[PIDTYPE_PID].pid;
+	struct signal_struct *sig = rcu_dereference(task->signal);
+	struct pid *pid = NULL;
+	if (sig)
+		pid = sig->tgid;
+	return pid;
 }
 
 static inline struct pid *task_pgrp(struct task_struct *task)
 {
-	return task->group_leader->pids[PIDTYPE_PGID].pid;
+	struct signal_struct *sig = rcu_dereference(task->signal);
+	struct pid *pid = NULL;
+	if (sig)
+		pid = sig->pids[PIDTYPE_PGID];
+	return pid;
 }
 
 static inline struct pid *task_session(struct task_struct *task)
 {
-	return task->group_leader->pids[PIDTYPE_SID].pid;
+	struct signal_struct *sig = rcu_dereference(task->signal);
+	struct pid *pid = NULL;
+	if (sig)
+		pid = sig->pids[PIDTYPE_SID];
+	return pid;
 }
 
 struct pid_namespace;
@@ -1371,7 +1385,7 @@ static inline pid_t task_ppid_nr_ns(struct task_struct *tsk,
  */
 static inline int pid_alive(struct task_struct *p)
 {
-	return p->pids[PIDTYPE_PID].pid != NULL;
+	return p->signal != NULL;
 }
 
 /**
@@ -1652,7 +1666,6 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
 extern void unblock_all_signals(void);
 extern void release_task(struct task_struct * p);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
-extern int send_group_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
 extern int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp);
@@ -1772,17 +1785,6 @@ extern void wait_task_inactive(struct task_struct * p);
 /* de_thread depends on thread_group_leader not being a pid based check */
 #define thread_group_leader(p)	(p == p->group_leader)
 
-/* Do to the insanities of de_thread it is possible for a process
- * to have the pid of the thread group leader without actually being
- * the thread group leader.  For iteration through the pids in proc
- * all we care about is that we have a task with the appropriate
- * pid, we don't actually care if we have the right task.
- */
-static inline int has_group_leader_pid(struct task_struct *p)
-{
-	return p->pid == p->tgid;
-}
-
 static inline
 int same_thread_group(struct task_struct *p1, struct task_struct *p2)
 {
@@ -1800,9 +1802,6 @@ static inline int thread_group_empty(struct task_struct *p)
 	return list_empty(&p->thread_group);
 }
 
-#define delay_group_leader(p) \
-		(thread_group_leader(p) && !thread_group_empty(p))
-
 /*
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
diff --git a/kernel/exit.c b/kernel/exit.c
index 1ab19f0..94552e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -57,7 +57,6 @@ static void exit_mm(struct task_struct * tsk);
 static void __unhash_process(struct task_struct *p)
 {
 	nr_threads--;
-	detach_pid(p, PIDTYPE_PID);
 	if (thread_group_leader(p)) {
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
@@ -65,6 +64,7 @@ static void __unhash_process(struct task_struct *p)
 		list_del_rcu(&p->tasks);
 		__get_cpu_var(process_counts)--;
 	}
+	detach_pid(p, PIDTYPE_PID);
 	list_del_rcu(&p->thread_group);
 	remove_parent(p);
 }
@@ -144,44 +144,15 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 
 void release_task(struct task_struct * p)
 {
-	struct task_struct *leader;
-	int zap_leader;
-repeat:
 	atomic_dec(&p->user->processes);
 	proc_flush_task(p);
 	write_lock_irq(&tasklist_lock);
 	ptrace_unlink(p);
 	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
 	__exit_signal(p);
-
-	/*
-	 * If we are the last non-leader member of the thread
-	 * group, and the leader is zombie, then notify the
-	 * group leader's parent process. (if it wants notification.)
-	 */
-	zap_leader = 0;
-	leader = p->group_leader;
-	if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
-		BUG_ON(leader->exit_signal == -1);
-		do_notify_parent(leader, leader->exit_signal);
-		/*
-		 * If we were the last child thread and the leader has
-		 * exited already, and the leader's parent ignores SIGCHLD,
-		 * then we are the one who should release the leader.
-		 *
-		 * do_notify_parent() will have marked it self-reaping in
-		 * that case.
-		 */
-		zap_leader = (leader->exit_signal == -1);
-	}
-
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
 	call_rcu(&p->rcu, delayed_put_task_struct);
-
-	p = leader;
-	if (unlikely(zap_leader))
-		goto repeat;
 }
 
 /*
@@ -633,8 +604,7 @@ reparent_thread(struct task_struct *p, struct task_struct *father, int traced)
 	/* If we'd notified the old parent about this child's death,
 	 * also notify the new parent.
 	 */
-	if (!traced && p->exit_state == EXIT_ZOMBIE &&
-	    p->exit_signal != -1 && thread_group_empty(p))
+	if (!traced && p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1)
 		do_notify_parent(p, p->exit_signal);
 
 	/*
@@ -702,8 +672,7 @@ static void forget_original_parent(struct task_struct *father)
 		} else {
 			/* reparent ptraced task to its real parent */
 			__ptrace_unlink (p);
-			if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1 &&
-			    thread_group_empty(p))
+			if (p->exit_state == EXIT_ZOMBIE && p->exit_signal != -1)
 				do_notify_parent(p, p->exit_signal);
 		}
 
@@ -773,6 +742,11 @@ static void exit_notify(struct task_struct *tsk)
 	exit_task_namespaces(tsk);
 
 	write_lock_irq(&tasklist_lock);
+	/* If we haven't yet made gettid() == getpid() do so now */
+	if (thread_group_leader(tsk) && (tsk->tid != tsk->signal->tgid)) {
+		detach_pid(tsk, PIDTYPE_PID);
+		attach_pid(tsk, PIDTYPE_PID, tsk->signal->tgid);
+	}
 	/*
 	 * Check to see if any process groups have become orphaned
 	 * as a result of our exiting, and if they have any stopped
@@ -818,7 +792,7 @@ static void exit_notify(struct task_struct *tsk)
 	 * send it a SIGCHLD instead of honoring exit_signal.  exit_signal
 	 * only has special meaning to our real parent.
 	 */
-	if (tsk->exit_signal != -1 && thread_group_empty(tsk)) {
+	if (tsk->exit_signal != -1) {
 		int signal = tsk->parent == tsk->real_parent ? tsk->exit_signal : SIGCHLD;
 		do_notify_parent(tsk, signal);
 	} else if (tsk->ptrace) {
@@ -946,6 +920,48 @@ fastcall NORET_TYPE void do_exit(long code)
 	}
 
 	tsk->flags |= PF_EXITING;
+	/* Transfer thread group leadership */
+	if (thread_group_leader(tsk) && !thread_group_empty(tsk)) {
+		struct task_struct *new_leader, *t;
+		write_lock_irq(&tasklist_lock);
+		for (t = next_thread(tsk); t != tsk; t = next_thread(t)) {
+			if (!(t->flags & PF_EXITING))
+				break;
+		}
+		if (t != tsk) {
+			new_leader = t;
+		
+			new_leader->start_time = tsk->start_time;
+			task_pid(tsk)->tsk = new_leader;
+			transfer_pid(tsk, new_leader, PIDTYPE_PGID);
+			transfer_pid(tsk, new_leader, PIDTYPE_SID);
+			list_replace_rcu(&tsk->tasks, &new_leader->tasks);
+
+			/* Update group_leader on all of the threads... */
+			new_leader->group_leader = new_leader;
+			tsk->group_leader = new_leader;
+			for (t = next_thread(tsk); t != tsk; t= next_thread(t)) {
+				t->group_leader = new_leader;
+			}
+
+			new_leader->exit_signal = tsk->exit_signal;
+			tsk->exit_signal = -1;
+
+			write_unlock_irq(&tasklist_lock);
+		} else {
+			write_unlock_irq(&tasklist_lock);
+			/* Wait for the other threads to exit before continuing */
+			for (;;) {
+				read_lock(&tasklist_lock);
+				if (thread_group_empty(tsk))
+					break;
+				__set_current_state(TASK_UNINTERRUPTIBLE);
+				read_unlock(&tasklist_lock);
+				schedule();
+			}
+			read_unlock(&tasklist_lock);
+		}
+	}
 	/*
 	 * tsk->flags are checked in the futex code to protect against
 	 * an exiting task cleaning up the robust pi futexes.
@@ -1106,20 +1122,18 @@ asmlinkage void sys_exit_group(int error_code)
 	do_group_exit((error_code & 0xff) << 8);
 }
 
-static int eligible_child(pid_t pid, int options, struct task_struct *p)
+static int eligible_child(enum pid_type type, struct pid *pid, int options, struct task_struct *p)
 {
 	int err;
-	struct pid_namespace *ns;
 
-	ns = current->nsproxy->pid_ns;
-	if (pid > 0) {
-		if (task_pid_nr_ns(p, ns) != pid)
+	if (type == PIDTYPE_PID) {
+		/* Match all pids pointing at task p */
+		if (pid_task(pid, PIDTYPE_PID) != p)
 			return 0;
-	} else if (!pid) {
-		if (task_pgrp_nr_ns(p, ns) != task_pgrp_vnr(current))
-			return 0;
-	} else if (pid != -1) {
-		if (task_pgrp_nr_ns(p, ns) != -pid)
+	} else if (type < PIDTYPE_MAX) {
+		struct signal_struct *sig;
+		sig = rcu_dereference(p->signal);
+		if (sig && (sig->pids[type] != pid))
 			return 0;
 	}
 
@@ -1346,7 +1360,8 @@ static int wait_task_stopped(struct task_struct *p,
 {
 	int retval, exit_code, why;
 	uid_t uid = 0; /* unneeded, required by compiler */
-	pid_t pid;
+	struct pid *pid;
+	pid_t upid;
 
 	exit_code = 0;
 	spin_lock_irq(&p->sighand->siglock);
@@ -1382,12 +1397,16 @@ unlock_sig:
 	 * possibly take page faults for user memory.
 	 */
 	get_task_struct(p);
-	pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
+	if (p->ptrace && same_thread_group(current, p->parent))
+		pid = task_pid(p);
+	else
+		pid = task_tgid(p);
+	upid = pid_nr_ns(pid, current->nsproxy->pid_ns);
 	why = (p->ptrace & PT_PTRACED) ? CLD_TRAPPED : CLD_STOPPED;
 	read_unlock(&tasklist_lock);
 
 	if (unlikely(noreap))
-		return wait_noreap_copyout(p, pid, uid,
+		return wait_noreap_copyout(p, upid, uid,
 					   why, exit_code,
 					   infop, ru);
 
@@ -1403,11 +1422,11 @@ unlock_sig:
 	if (!retval && infop)
 		retval = put_user(exit_code, &infop->si_status);
 	if (!retval && infop)
-		retval = put_user(pid, &infop->si_pid);
+		retval = put_user(upid, &infop->si_pid);
 	if (!retval && infop)
 		retval = put_user(uid, &infop->si_uid);
 	if (!retval)
-		retval = pid;
+		retval = upid;
 	put_task_struct(p);
 
 	BUG_ON(!retval);
@@ -1425,7 +1444,8 @@ static int wait_task_continued(struct task_struct *p, int noreap,
 			       int __user *stat_addr, struct rusage __user *ru)
 {
 	int retval;
-	pid_t pid;
+	struct pid *pid;
+	pid_t upid;
 	uid_t uid;
 
 	if (!(p->signal->flags & SIGNAL_STOP_CONTINUED))
@@ -1440,8 +1460,11 @@ static int wait_task_continued(struct task_struct *p, int noreap,
 	if (!noreap)
 		p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
 	spin_unlock_irq(&p->sighand->siglock);
-
-	pid = task_pid_nr_ns(p, current->nsproxy->pid_ns);
+	if (p->ptrace && same_thread_group(current, p->parent))
+		pid = task_pid(p);
+	else
+		pid = task_tgid(p);
+	upid = pid_nr_ns(pid, current->nsproxy->pid_ns);
 	uid = p->uid;
 	get_task_struct(p);
 	read_unlock(&tasklist_lock);
@@ -1452,9 +1475,9 @@ static int wait_task_continued(struct task_struct *p, int noreap,
 		if (!retval && stat_addr)
 			retval = put_user(0xffff, stat_addr);
 		if (!retval)
-			retval = pid;
+			retval = upid;
 	} else {
-		retval = wait_noreap_copyout(p, pid, uid,
+		retval = wait_noreap_copyout(p, upid, uid,
 					     CLD_CONTINUED, SIGCONT,
 					     infop, ru);
 		BUG_ON(retval == 0);
@@ -1463,13 +1486,25 @@ static int wait_task_continued(struct task_struct *p, int noreap,
 	return retval;
 }
 
-static long do_wait(pid_t pid, int options, struct siginfo __user *infop,
+static long do_wait(pid_t upid, int options, struct siginfo __user *infop,
 		    int __user *stat_addr, struct rusage __user *ru)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	struct task_struct *tsk;
 	int flag, retval;
-
+	struct pid *pid = NULL;
+	enum pid_type type = PIDTYPE_MAX;
+
+	if (upid > 0) {
+		type = PIDTYPE_PID;
+		pid = find_get_pid(upid);
+	} else if (upid == 0) {
+		type = PIDTYPE_PGID;
+		pid = get_pid(task_pgrp(current));
+	} else if (upid < -1) {
+		type = PIDTYPE_PGID;
+		pid = find_get_pid(-upid);
+	}
 	add_wait_queue(&current->signal->wait_chldexit,&wait);
 repeat:
 	/*
@@ -1484,7 +1519,7 @@ repeat:
 		struct task_struct *p;
 
 		list_for_each_entry(p, &tsk->children, sibling) {
-			int ret = eligible_child(pid, options, p);
+			int ret = eligible_child(type, pid, options, p);
 			if (!ret)
 				continue;
 
@@ -1503,8 +1538,7 @@ repeat:
 				retval = wait_task_stopped(p,
 						(options & WNOWAIT), infop,
 						stat_addr, ru);
-			} else if (p->exit_state == EXIT_ZOMBIE &&
-					!delay_group_leader(p)) {
+			} else if (p->exit_state == EXIT_ZOMBIE) {
 				/*
 				 * We don't reap group leaders with subthreads.
 				 */
@@ -1531,7 +1565,7 @@ repeat:
 		if (!flag) {
 			list_for_each_entry(p, &tsk->ptrace_children,
 								ptrace_list) {
-				flag = eligible_child(pid, options, p);
+				flag = eligible_child(type, pid, options, p);
 				if (!flag)
 					continue;
 				if (likely(flag > 0))
@@ -1560,6 +1594,7 @@ repeat:
 end:
 	current->state = TASK_RUNNING;
 	remove_wait_queue(&current->signal->wait_chldexit,&wait);
+	put_pid(pid);
 	if (infop) {
 		if (retval > 0)
 			retval = 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index 7abb592..e986be9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -883,7 +883,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->it_real_incr.tv64 = 0;
 	sig->real_timer.function = it_real_fn;
-	sig->tsk = tsk;
 
 	sig->it_virt_expires = cputime_zero;
 	sig->it_virt_incr = cputime_zero;
@@ -1308,6 +1307,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 			if (clone_flags & CLONE_NEWPID)
 				p->nsproxy->pid_ns->child_reaper = p;
 
+			p->signal->tgid = pid;
 			p->signal->tty = current->signal->tty;
 			set_task_pgrp(p, task_pgrp_nr(current));
 			set_task_session(p, task_session_nr(current));
diff --git a/kernel/itimer.c b/kernel/itimer.c
index 2fab344..f40b589 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -132,7 +132,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
 
-	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk);
+	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->tgid);
 
 	return HRTIMER_NORESTART;
 }
diff --git a/kernel/pid.c b/kernel/pid.c
index 21f027c..b45b53d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -319,28 +319,39 @@ EXPORT_SYMBOL_GPL(find_pid);
 int fastcall attach_pid(struct task_struct *task, enum pid_type type,
 		struct pid *pid)
 {
-	struct pid_link *link;
-
-	link = &task->pids[type];
-	link->pid = pid;
-	hlist_add_head_rcu(&link->node, &pid->tasks[type]);
+	if (type == PIDTYPE_PID) {
+		task->tid = pid;
+		pid->tsk = task;
+	}
+	else {
+		task->signal->pids[type] = pid;
+		hlist_add_head_rcu(&task->pids[type], &pid->tasks[type]);
+	}
 
 	return 0;
 }
 
 void fastcall detach_pid(struct task_struct *task, enum pid_type type)
 {
-	struct pid_link *link;
-	struct pid *pid;
+	struct pid **ppid, *pid;
 	int tmp;
 
-	link = &task->pids[type];
-	pid = link->pid;
-
-	hlist_del_rcu(&link->node);
-	link->pid = NULL;
+	if (type == PIDTYPE_PID) {
+		ppid = &task->tid;
+		pid = *ppid;
+		if (pid->tsk == task)
+			pid->tsk = NULL;
+	}
+	else {
+		hlist_del_rcu(&task->pids[type]);
+		ppid = &task->signal->pids[type];
+	}
+	pid = *ppid;
+	*ppid = NULL;
 
-	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
+	if (pid->tsk)
+		return;
+	for (tmp = PIDTYPE_MAX -1; --tmp >= 0; )
 		if (!hlist_empty(&pid->tasks[tmp]))
 			return;
 
@@ -351,19 +362,22 @@ void fastcall detach_pid(struct task_struct *task, enum pid_type type)
 void fastcall transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
 {
-	new->pids[type].pid = old->pids[type].pid;
-	hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
-	old->pids[type].pid = NULL;
+	hlist_replace_rcu(&old->pids[type], &new->pids[type]);
 }
 
 struct task_struct * fastcall pid_task(struct pid *pid, enum pid_type type)
 {
 	struct task_struct *result = NULL;
 	if (pid) {
-		struct hlist_node *first;
-		first = rcu_dereference(pid->tasks[type].first);
-		if (first)
-			result = hlist_entry(first, struct task_struct, pids[(type)].node);
+		if (type == PIDTYPE_PID)
+			result = rcu_dereference(pid->tsk);
+		else {
+			struct hlist_node *first;
+			first = rcu_dereference(pid->tasks[type].first);
+			if (first)
+				result = hlist_entry(first, struct task_struct,
+						     pids[(type)]);
+		}
 	}
 	return result;
 }
@@ -402,7 +416,11 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
 {
 	struct pid *pid;
 	rcu_read_lock();
-	pid = get_pid(task->pids[type].pid);
+	if (type == PIDTYPE_PID)
+		pid = task->tid;
+	else
+		pid = task->signal->pids[type];
+	get_pid(pid);
 	rcu_read_unlock();
 	return pid;
 }
diff --git a/kernel/signal.c b/kernel/signal.c
index 06e663d..af8c49f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1195,20 +1195,6 @@ send_sig(int sig, struct task_struct *p, int priv)
 	return send_sig_info(sig, __si_special(priv), p);
 }
 
-/*
- * This is the entry point for "process-wide" signals.
- * They will go to an appropriate thread in the thread group.
- */
-int
-send_group_sig_info(int sig, struct siginfo *info, struct task_struct *p)
-{
-	int ret;
-	read_lock(&tasklist_lock);
-	ret = group_send_sig_info(sig, info, p);
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 void
 force_sig(int sig, struct task_struct *p)
 {
@@ -1501,12 +1487,15 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	unsigned long flags;
 	struct task_struct *parent;
 	struct sighand_struct *sighand;
+	struct pid *pid;
 
-	if (tsk->ptrace & PT_PTRACED)
+	if (tsk->ptrace & PT_PTRACED) {
 		parent = tsk->parent;
-	else {
+		pid = task_pid(tsk);
+	} else {
 		tsk = tsk->group_leader;
 		parent = tsk->real_parent;
+		pid = task_tgid(tsk);
 	}
 
 	info.si_signo = SIGCHLD;
@@ -1515,7 +1504,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	 * see comment in do_notify_parent() abot the following 3 lines
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+	info.si_pid = pid_nr_ns(pid, parent->nsproxy->pid_ns);
 	rcu_read_unlock();
 
 	info.si_uid = tsk->uid;
-- 
1.5.3.rc6.17.g1911

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/