linux-kernel - Re: Threads stuck in zap_pid_ns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8760h79e22.fsf@xmission.com>
Date:   Thu, 11 May 2017 18:19:49 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Guenter Roeck <linux@...ck-us.net>
Cc:     Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
        Vovo Yang <vovoy@...gle.com>
Subject: Re: Threads stuck in zap_pid_ns_processes()

Guenter Roeck <linux@...ck-us.net> writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@...ck-us.net> writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck <linux@...ck-us.net> writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc/<pid>/stack is
>> >> >
>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>        Further child2 calls do_group_exit forcing child1 to
>> >>        exit making for fun races.
>> >> 
>> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>        the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>        waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> >
>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
> 	count = 0;
> 	nr = next_pidmap(pid_ns, 0);
> 	while (nr > 0) {
> 		count++;
> 		nr = next_pidmap(pid_ns, nr);
> 	}
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().
>
>> > Is there maybe a better way to terminate the wait loop than
>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>> 
>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>> in attached to processes and in the hash table for looking up tasks.
>> 
>> Waiting until that drops to just the last task in the pid namespace
>> really does do a good job of counting what we are looking for.  It
>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>> nr_hashed will be 2.  For various reasons.
>> 
>> That said it isn't as obvious as it should be and I would like
>> to improve that.
>> 
>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>> disk sleep nor anything guaranteed to complete in any sort of
>> finite time.  Plus the hung task timeout warning only triggers
>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>> the warning and keep a long delay in zap_pid_ns_processes from
>> bumping up your load average.
>> 
>> That still leaves the question of what state strange state
>> you are getting into but this should at least prevent the unexepcted
>> reboots.
>> 
>
> What I know so far is 
> - We see this condition on a regular basis in the field. Regular is
>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>   per day reports a crash because of it. That is not that many,
>   but it adds up.
> - We are able to reproduce the problem with a performance benchmark
>   which opens 100 chrome tabs. While that is a lot, it should not
>   result in a kernel hang/crash.
> - Vovo proviced the test code last night. I don't know if this is
>   exactly what is observed in the benchmark, or how it relates to the
>   benchmark in the first place, but it is the first time we are actually
>   able to reliably create a condition where the problem is seen.

Thank you.  I will be interesting to hear what is happening in the
chrome perfomance benchmark that triggers this.

There may be some other steps that are worth taking.

>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 77403c157157..971e7bc6939b 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>          * if reparented.
>>          */
>>         for (;;) {
>> -               set_current_state(TASK_UNINTERRUPTIBLE);
>> +               set_current_state(TASK_INTERRUPTIBLE);
>>                 if (pid_ns->nr_hashed == init_pids)
>>                         break;
>>                 schedule();
>> 
>
> I'll give it a try. Are there any potential unintended side effects ?

No.  The only reason I didn't implement this way in the first place is
that I have cognitive disssonance between the name of that state and
the fact the code isn't interruptible.

In the long term it is probably better to refactor that code path so it
doesn't exist in the first place, and use more common kernel idioms.
But for right now it should not be a problem.


Eric