lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 12 May 2017 09:52:14 -0700
From:   Guenter Roeck <>
To:     "Eric W. Biederman" <>
Cc:     Vovo Yang <>, Ingo Molnar <>,
Subject: Re: Threads stuck in zap_pid_ns_processes()

Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> Vovo Yang <> writes:
> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> > <> wrote:
> >> Guenter Roeck <> writes:
> >>
> >>> What I know so far is
> >>> - We see this condition on a regular basis in the field. Regular is
> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >>>   per day reports a crash because of it. That is not that many,
> >>>   but it adds up.
> >>> - We are able to reproduce the problem with a performance benchmark
> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >>>   result in a kernel hang/crash.
> >>> - Vovo proviced the test code last night. I don't know if this is
> >>>   exactly what is observed in the benchmark, or how it relates to the
> >>>   benchmark in the first place, but it is the first time we are actually
> >>>   able to reliably create a condition where the problem is seen.
> >>
> >> Thank you.  I will be interesting to hear what is happening in the
> >> chrome perfomance benchmark that triggers this.
> >>
> > What's happening in the benchmark:
> > 1. A chrome renderer process was created with CLONE_NEWPID
> > 2. The process crashed
> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
> >   threads of the crashed process to dump info
> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >   zap_pid_ns_processes()
> Very interesting thank you.
> So the question is specifically which interaction is causing this.
> In the test case provided it was a sibling task in the pid namespace
> dying and not being reaped.  Which may be what is happening with
> breakpad.  So far I have yet to see kernel bug but I won't rule one out.

I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?


Powered by blists - more mailing lists