linux-kernel - Re: Threads stuck in zap_pid_ns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170512194304.GE12960@roeck-us.net>
Date:   Fri, 12 May 2017 12:43:04 -0700
From:   Guenter Roeck <linux@...ck-us.net>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Vovo Yang <vovoy@...gle.com>, Ingo Molnar <mingo@...nel.org>,
        linux-kernel@...r.kernel.org
Subject: Re: Threads stuck in zap_pid_ns_processes()

Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@...ck-us.net> writes:
> 
> > Hi Eric,
> >
> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> >> Vovo Yang <vovoy@...gle.com> writes:
> >> 
> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >> > <ebiederm@...ssion.com> wrote:
> >> >> Guenter Roeck <linux@...ck-us.net> writes:
> >> >>
> >> >>> What I know so far is
> >> >>> - We see this condition on a regular basis in the field. Regular is
> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >> >>>   per day reports a crash because of it. That is not that many,
> >> >>>   but it adds up.
> >> >>> - We are able to reproduce the problem with a performance benchmark
> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >> >>>   result in a kernel hang/crash.
> >> >>> - Vovo proviced the test code last night. I don't know if this is
> >> >>>   exactly what is observed in the benchmark, or how it relates to the
> >> >>>   benchmark in the first place, but it is the first time we are actually
> >> >>>   able to reliably create a condition where the problem is seen.
> >> >>
> >> >> Thank you.  I will be interesting to hear what is happening in the
> >> >> chrome perfomance benchmark that triggers this.
> >> >>
> >> > What's happening in the benchmark:
> >> > 1. A chrome renderer process was created with CLONE_NEWPID
> >> > 2. The process crashed
> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
> >> >   threads of the crashed process to dump info
> >> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >> >   zap_pid_ns_processes()
> >> 
> >> Very interesting thank you.
> >> 
> >> So the question is specifically which interaction is causing this.
> >> 
> >> In the test case provided it was a sibling task in the pid namespace
> >> dying and not being reaped.  Which may be what is happening with
> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> >> 
> >
> > I am trying to understand what you are looking for. I would have thought
> > that both the test application as well as the Chrome functionality
> > described above show that there are situations where zap_pid_ns_processes()
> > can get stuck and cause hung task timeouts in conjunction with the use of
> > ptrace().
> >
> > Your last sentence seems to suggest that you believe that the kernel might
> > do what it is expected to do. Assuming this is the case, what else would
> > you like to see ? A test application which matches exactly the Chrome use
> > case ? We can try to provide that, but I don't entirely understand how
> > that would change the situation. After all, we already know that it is
> > possible to get a thread into this condition, and we already have one
> > means to reproduce it.
> >
> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> > test application and the Chrome benchmark. The thread is still stuck in
> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> > and no longer results in a hung task timeout. It remains in that state
> > until the parent process terminates. I am not entirely happy with it
> > since the processes are still stuck and may pile up over time, but at
> > least it solves the immediate problem for us.
> >
> > Question now is what to do with that solution. We can of course apply
> > it locally to Chrome OS, but I would rather have it upstream - especially
> > since we have to assume that any users of Chrome on Linux, or more
> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> > experience the same problem. Right now I have no idea how to get there,
> > though. Can you provide some guidance ?
> 
> Apologies for not being clear.  I intend to send a pull request with the

No worries.

> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
> next week or so with a Cc stable and an appropriate Fixes tag.  So the
> fix can be backported.
> 
Great, that is good to know.

> I have a more comprehensive change queued I will probably merge for 4.13
> already but it just changes what kind of zombies you see.  It won't
> remove the ``stuck'' zombies.
> 
> So what I am looking for now is:
> Why are things getting stuck in your benchmark?
> 
> -  Is it a userspace bug?
> 
>   In which case we can figure out what userspace (aka breakpad) needs
>    to do to avoid the problem.
>    
I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.

> -  Is it a kernel bug with ptrace?
> 
>    There have been a lot of little subtle bugs with ptrace over the
>    years so one more would not surprise
> 
Good question. I suspect that PTRACE_ATTACH (or PTRACE_TRACEME)
changes the reaper to be the calling process, and a subsequent
PTRACE_DETACH or exit of the process calling PTRACE_TRACEME doesn't
change it back or changes it to the process group owner. Is that
what happens, and if so is it what should happen ? I don't know (yet).

> So I am just looking to make certain we fix the root issue not just
> the hung task timeout warning.
> 
Makes sense. At least with TASK_INTERRUPTIBLE, I suspect we are just
fixing the symptom, but on the other side that isn't really my area
of expertise, so I don't really know.

One additional data point: If I time the threads in the test code by
adding various calls to sleep(), it is still more or less random if the
task status ends up in Z or D, meaning the child thread is sometimes
reaped and sometimes not. I would have expected it to be well defined.
I may be wrong, but that smells racy to me.

Anyway, I'll spend some more time on this. I'll let you know if I find
anything.

Thanks,
Guenter