linux-kernel - Re: [PATCH] pid_ns: Fix race between setns'ed fork() and zap_pid_ns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b5d58321-d13c-8cb6-a66a-f3f86a48b16c@virtuozzo.com>
Date:   Fri, 12 May 2017 18:17:56 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
CC:     <mhocko@...e.com>, <avagin@...nvz.org>, <peterz@...radead.org>,
        <oleg@...hat.com>, <linux-kernel@...r.kernel.org>,
        <rppt@...ux.vnet.ibm.com>, <luto@...nel.org>,
        <gorcunov@...nvz.org>, <akpm@...ux-foundation.org>,
        <mingo@...nel.org>, <serge@...lyn.com>
Subject: Re: [PATCH] pid_ns: Fix race between setns'ed fork() and
 zap_pid_ns_processes()



On 12.05.2017 17:49, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@...tuozzo.com> writes:
> 
>> On 12.05.2017 17:26, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@...tuozzo.com> writes:
>>>
>>>> Imagine we have a pid namespace and a task from its parent's pid_ns,
>>>> which made setns() to the pid namespace. The task is doing fork(),
>>>> while the pid namespace's child reaper is dying. We have the race
>>>> between them:
>>>>
>>>> Task from parent pid_ns             Child reaper
>>>> copy_process()                      ..
>>>>   alloc_pid()                       ..
>>>>   ..                                zap_pid_ns_processes()
>>>>   ..                                  disable_pid_allocation()
>>>>   ..                                  read_lock(&tasklist_lock)
>>>>   ..                                  iterate over pids in pid_ns
>>>>   ..                                    kill tasks linked to pids
>>>>   ..                                  read_unlock(&tasklist_lock)
>>>>   write_lock_irq(&tasklist_lock);   ..
>>>>   attach_pid(p, PIDTYPE_PID);       ..
>>>>   ..                                ..
>>>>
>>>> So, just created task p won't receive SIGKILL signal,
>>>> and the pid namespace will be in contradictory state.
>>>> Only manual kill will help there, but does the userspace
>>>> care about this? I suppose, the most users just inject
>>>> a task into a pid namespace and wait a SIGCHLD from it.
>>>>
>>>> The patch fixes the problem. It moves disable_pid_allocation()
>>>> into find_child_reaper() where tasklist_lock is held,
>>>> and this allows to simply check for (pid_ns->nr_hashed & PIDNS_HASH_ADDING)
>>>> in copy_process(). If allocation is disabled, we just
>>>> return -ENOMEM like it's made for such cases in alloc_pid().
>>>
>>> This problem sounds very theoretical has it ever come up in practice?
>>> I am asking to see if this is something we will care enough about to
>>> backport.
>>
>> I haven't seen this on practice. I think we may apply the policy, which
>> used to coverity reports, though it's not a one.
>>
>>> Please look at what happens when you call
>>> spin_unlock_irq(&pidmap_lock) under writelock_irq(&tasklist_lock);
>>
>> Ah, missed that, thanks.
>>  
>>> Please also look at what happens when pid == &init_pid but
>>> p->nsproxy->pid_ns_for_children happens to be have PIDNS_HASH_ADDING
>>> set.
> 
> Apologies I meant PIDNS_HASH_ADDING clear.
> 
>> init pid refers to init_pid_ns, which has PIDNS_HASH_ADDING set. So,
>> there shouldn't be a problem.
>>
>> Could you explain, what do you mean?
> 
> I mean locally in copy_process your code is not correct.
> Instead of caching pid_ns you want to use ns_of_pid(pid) so that
> if pid == &init_pid you don't care what strange things are going on
> in the calling process.

Hm. If pid is init_struct_pid, then we're forking INIT_TASK.
p = dup_task_struct(current, node), so the p->nsproxy == INIT_TASK->nsproxy,
i.e. init_nsproxy. Its pid_ns_for_children refers to init_pid_ns. There is
no a problem, it's just a code simplification.

But if it seems not clear for you, I may do something like below. How you are
about that?
---
diff --git a/kernel/fork.c b/kernel/fork.c
index bfd91b180778..e9835693d299 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1523,6 +1523,7 @@ static __latent_entropy struct task_struct *copy_process(
 					unsigned long tls,
 					int node)
 {
+	struct pid_namespace *pid_ns = NULL;
 	int retval;
 	struct task_struct *p;
 
@@ -1736,7 +1737,8 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+		pid_ns = p->nsproxy->pid_ns_for_children;
+		pid = alloc_pid(pid_ns);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_thread;
@@ -1845,10 +1847,11 @@ static __latent_entropy struct task_struct *copy_process(
 	*/
 	recalc_sigpending();
 	if (signal_pending(current)) {
-		spin_unlock(&current->sighand->siglock);
-		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_cancel_cgroup;
+		goto bad_fork_unlock_siglock;
+	} else if (unlikely(pid_ns && !(pid_ns->nr_hashed & PIDNS_HASH_ADDING))) {
+		retval = -ENOMEM;
+		goto bad_fork_unlock_siglock;
 	}
 
 	if (likely(p->pid)) {
@@ -1906,7 +1909,9 @@ static __latent_entropy struct task_struct *copy_process(
 
 	return p;
 
-bad_fork_cancel_cgroup:
+bad_fork_unlock_siglock:
+	spin_unlock(&current->sighand->siglock);
+	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
 bad_fork_free_pid:
 	cgroup_threadgroup_change_end(current);