linux-kernel - Re: Threads stuck in zap_pid_ns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAASEscF4eajN_-F-VMq_4C+Hf9j3eXj1V5iESvyjT0B+911Q9A@mail.gmail.com>
Date:   Fri, 12 May 2017 17:30:15 +0800
From:   Vovo Yang <vovoy@...gle.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Guenter Roeck <linux@...ck-us.net>, Ingo Molnar <mingo@...nel.org>,
        linux-kernel@...r.kernel.org
Subject: Re: Threads stuck in zap_pid_ns_processes()

On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
<ebiederm@...ssion.com> wrote:
> Guenter Roeck <linux@...ck-us.net> writes:
>
>> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck <linux@...ck-us.net> writes:
>>>
>>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>>> >> Guenter Roeck <linux@...ck-us.net> writes:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > the test program attached below almost always results in one of the child
>>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>>> >> > a single thread left in the pid namespace (the one that is stuck).
>>> >> > Traceback from /proc/<pid>/stack is
>>> >> >
>>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>>> >> >
>>> >> > After 120 seconds, I get the "hung task" message.
>>> >> >
>>> >> > Example from v4.11:
>>> >> >
>>> >> > ...
>>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>>> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
>>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>>> >> > [ 3263.379587] Call Trace:
>>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>>> >> > ...
>>> >> >
>>> >> > The problem is seen in all kernels up to v4.11.
>>> >> >
>>> >> > Any idea what might be going on and how to fix the problem ?
>>> >>
>>> >> Let me see.  Reading the code it looks like we have three tasks
>>> >> let's call them main, child1, and child2.
>>> >>
>>> >> child1 and child2 are started using CLONE_THREAD and are
>>> >> thus clones of one another.
>>> >>
>>> >> child2 exits first but is ptraced by main so is not reaped.
>>> >>        Further child2 calls do_group_exit forcing child1 to
>>> >>        exit making for fun races.
>>> >>
>>> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>>> >>        the group exit and make the window larger.
>>> >>
>>> >> child1 exits next and calls zap_pid_ns_processes and is
>>> >>        waiting for child2 to be reaped by main.
>>> >>
>>> >> main is just sitting around doing nothing for 3600 seconds
>>> >> not reaping anyone.
>>> >>
>>> >> I would expect that when main exits everything would be cleaned up
>>> >> and the only real issue is that we have a hung task warning.
>>> >>
>>> >> Does everything cleanup when main exits?
>>> >>
>>> >
>>> > As an add-on to my previous mail: I added a function to count
>>> > the number of threads in the pid namespace, using next_pidmap().
>>> > Even though nr_hashed == 2, only the hanging thread is still
>>> > present.
>>>
>>> For your testcase?  I suspect you copied the code from
>>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>>> calling zap_pid_ns_processes.
>>>
>>
>> Almost. Something along the line of
>>
>>       count = 0;
>>       nr = next_pidmap(pid_ns, 0);
>>       while (nr > 0) {
>>               count++;
>>               nr = next_pidmap(pid_ns, nr);
>>       }
>>
>> only I also call sched_show_task() for each thread, and the only
>> one printed is the one that hangs in zap_pid_ns_processes().
>>
>>> > Is there maybe a better way to terminate the wait loop than
>>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>>>
>>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>>> in attached to processes and in the hash table for looking up tasks.
>>>
>>> Waiting until that drops to just the last task in the pid namespace
>>> really does do a good job of counting what we are looking for.  It
>>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>>> nr_hashed will be 2.  For various reasons.
>>>
>>> That said it isn't as obvious as it should be and I would like
>>> to improve that.
>>>
>>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>>> disk sleep nor anything guaranteed to complete in any sort of
>>> finite time.  Plus the hung task timeout warning only triggers
>>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>>> the warning and keep a long delay in zap_pid_ns_processes from
>>> bumping up your load average.
>>>
>>> That still leaves the question of what state strange state
>>> you are getting into but this should at least prevent the unexepcted
>>> reboots.
>>>
>>
>> What I know so far is
>> - We see this condition on a regular basis in the field. Regular is
>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>   per day reports a crash because of it. That is not that many,
>>   but it adds up.
>> - We are able to reproduce the problem with a performance benchmark
>>   which opens 100 chrome tabs. While that is a lot, it should not
>>   result in a kernel hang/crash.
>> - Vovo proviced the test code last night. I don't know if this is
>>   exactly what is observed in the benchmark, or how it relates to the
>>   benchmark in the first place, but it is the first time we are actually
>>   able to reliably create a condition where the problem is seen.
>
> Thank you.  I will be interesting to hear what is happening in the
> chrome perfomance benchmark that triggers this.
>

What's happening in the benchmark:
1. A chrome renderer process was created with CLONE_NEWPID
2. The process crashed
3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
  threads of the crashed process to dump info
4. When breakpad detach the crashed process, the crashed process stuck in
  zap_pid_ns_processes()

Thanks,
Vovo

> There may be some other steps that are worth taking.
>
>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>> index 77403c157157..971e7bc6939b 100644
>>> --- a/kernel/pid_namespace.c
>>> +++ b/kernel/pid_namespace.c
>>> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>>          * if reparented.
>>>          */
>>>         for (;;) {
>>> -               set_current_state(TASK_UNINTERRUPTIBLE);
>>> +               set_current_state(TASK_INTERRUPTIBLE);
>>>                 if (pid_ns->nr_hashed == init_pids)
>>>                         break;
>>>                 schedule();
>>>
>>
>> I'll give it a try. Are there any potential unintended side effects ?
>
> No.  The only reason I didn't implement this way in the first place is
> that I have cognitive disssonance between the name of that state and
> the fact the code isn't interruptible.
>
> In the long term it is probably better to refactor that code path so it
> doesn't exist in the first place, and use more common kernel idioms.
> But for right now it should not be a problem.
>
>
> Eric
>