linux-kernel - Re: Threads stuck in zap_pid_ns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170511204848.GA11306@roeck-us.net>
Date:   Thu, 11 May 2017 13:48:48 -0700
From:   Guenter Roeck <linux@...ck-us.net>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
        Vovo Yang <vovoy@...gle.com>
Subject: Re: Threads stuck in zap_pid_ns_processes()

On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@...ck-us.net> writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck <linux@...ck-us.net> writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc/<pid>/stack is
> >> >
> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>        Further child2 calls do_group_exit forcing child1 to
> >>        exit making for fun races.
> >> 
> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>        the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>        waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> > Yes, it does.
> >
> > Problem is that if the main task doesn't exit, it hangs forever.
> > Chrome OS (where we see the problem in the field, and the application
> > is chrome) is configured to reboot on hung tasks - if a task is hung
> > for 120 seconds on those systems, it tends to be in a bad shape.
> > This makes it a quite severe problem for us.
> 
> We definitely need to change the code in such a way as to not
> trigger the hung task warning.
> 
> > I browsed through a number of stack traces - the ones I looked at
> > all have the same traceback from do_group_exit(), though the
> > caller of do_group_exit() is not always the same (that may depend
> > on the kernel version, though). Sometimes it is __wake_up_parent(),
> > sometimes it is get_signal_to_deliver(), sometimes it is
> > get_signal().
> >
> > Are there other conditions besides ptrace where a task isn't reaped ?
> 
> The problem here is that the parent was outside of the pid namespace
> and was choosing not to reap the child.  The main task could easily
> have reaped the child when it received SIGCHLD.
> 
> Arranging for forked children to run in a pid namespace with setns
> can also create processes with parents outside the pid namespace,
> that may or may not be reaped.
> 
> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
> or the child has called exec.  That is not enough to block your test
> case but I think it would block real world problems.
> 
> For getting to the root cause of this I would see if I could get
> a full process list.  It would be interesting to know what
> the unreaped zombie is.
> 
root     11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
root     11580 11579  0 13:38 pts/15   00:00:00 ./clone
root     11581 11580  0 13:38 pts/15   00:00:00 [clone]

I assume this means that it is child1 ? What other information
would be needed ?

Thanks,
Guenter

> In the meantime I am going to see what it takes to change the code
> so it doesn't trigger the hung task warning.  I don't want to change
> the actual observed behavior of the code but I can avoid triggering
> a warning that does not apply.
> 
> Eric
> 
> 
> >
> > Thanks,
> > Guenter
> >
> >> Eric
> >> 
> >> 
> >> >
> >> > Thanks,
> >> > Guenter
> >> >
> >> > ---
> >> > This test program was kindly provided by Vovo Yang <vovoy@...gle.com>.
> >> >
> >> > Note that the ptrace() call in child1() is not necessary for the problem
> >> > to be seen, though it seems to make it a bit more likely.
> >> 
> >> That would appear to just slow things down a smidge.    As there is
> >> nothing substantial that happens ptrace wise except until after
> >> zap_pid_ns_processes.
> >> 
> >> 
> >> > ---
> >> >
> >> > #define _GNU_SOURCE
> >> > #include <stdio.h>
> >> > #include <stdlib.h>
> >> > #include <unistd.h>
> >> > #include <sys/ptrace.h>
> >> > #include <errno.h>
> >> > #include <string.h>
> >> > #include <sched.h>
> >> >
> >> > #define STACK_SIZE 65536
> >> >
> >> > int child1(void* arg);
> >> > int child2(void* arg);
> >> >
> >> > int main(int argc, char **argv)
> >> > {
> >> >   int child_pid;
> >> >   char* child_stack = malloc(STACK_SIZE);
> >> >   char* stack_top = child_stack + STACK_SIZE;
> >> >   char command[256];
> >> >
> >> >   child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
> >> >   if (child_pid == -1) {
> >> >     printf("parent: clone failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >   printf("parent: child1_pid: %d\n", child_pid);
> >> >
> >> >   sleep(2);
> >> >   printf("child state, if it's D (disk sleep), the child process is hung\n");
> >> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
> >> >   system(command);
> >> >   sleep(3600);
> >> >   return EXIT_SUCCESS;
> >> > }
> >> >
> >> > int child1(void* arg)
> >> > {
> >> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
> >> >   char* child_stack = malloc(STACK_SIZE);
> >> >   char* stack_top = child_stack + STACK_SIZE;
> >> >   long ret;
> >> >
> >> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child1: ptrace failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >
> >> >   ret = clone(&child2, stack_top, flags, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child1: clone failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >   printf("child1: child2 pid: %ld\n", ret);
> >> >
> >> >   sleep(1);
> >> >   printf("child1: end\n");
> >> >   return EXIT_SUCCESS;
> >> > }
> >> >
> >> > int child2(void* arg)
> >> > {
> >> >   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child2: ptrace failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >
> >> >   printf("child2: end\n");
> >> >   return EXIT_SUCCESS;
> >> > }