--- DOCb	2011-05-30 17:11:48.000619211 +0200
+++ DOCc	2011-06-03 16:54:21.627633373 +0200
@@ -72,32 +72,27 @@
 group reports its death to its tracer.
 
 If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
-before actual death. This applies to both normal exits and signal
-deaths (except SIGKILL).
-
-KNOWN BUG: PTRACE_EVENT_EXIT should happen for every tracee in thread
-group on exit_group or signal death, but currently (~2.6.38) this is
-buggy: some of these stops may be missed.
+before actual death. This applies to exits on exit syscall, group_exit
+syscall, signal deaths (except SIGKILL), and when threads are torn down
+on execve in multi-threaded process.
 
 Tracer cannot assume that ptrace-stopped tracee exists. There are many
-scenarios when tracee may die while stopped (such as SIGKILL). There
-are cases where tracee disappears without reporting death (such as
-execve in multi-threaded process). Therefore, tracer must always be
-prepared to handle ESRCH error on any ptrace operation. Unfortunately,
-the same error is returned if tracee exists but is not ptrace-stopped
-(for commands which require stopped tracee). Tracer needs to keep track
-of stopped/running state, and interpret ESRCH as "tracee died
-unexpectedly" only if it knows that tracee has been observed to enter
-ptrace-stop.
-
-There is no guarantee that waitpid(WNOHANG) will reliably report
-tracee's death status if ptrace operation returned ESRCH.
-waitpid(WNOHANG) may return 0 instead. IOW: tracee may be "not yet
-fully dead" but already refusing ptrace ops.
+scenarios when tracee may die while stopped (such as SIGKILL).
+Therefore, tracer must always be prepared to handle ESRCH error on any
+ptrace operation. Unfortunately, the same error is returned if tracee
+exists but is not ptrace-stopped (for commands which require stopped
+tracee), or if it is not traced by process which issued ptrace call.
+Tracer needs to keep track of stopped/running state, and interpret
+ESRCH as "tracee died unexpectedly" only if it knows that tracee has
+been observed to enter ptrace-stop. Note that there is no guarantee
+that waitpid(WNOHANG) will reliably report tracee's death status if
+ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead.
+IOW: tracee may be "not yet fully dead" but already refusing ptrace ops.
 
 Tracer can not assume that tracee ALWAYS ends its life by reporting
-WIFEXITED(status) or WIFSIGNALED(status). One notable case is execve in
-multi-threaded process, which is described later.
+WIFEXITED(status) or WIFSIGNALED(status).
+
+??? or can it? Do we include such a promise into ptrace API?
 
 
 	1.x Stopped states.
@@ -112,14 +107,15 @@
 WIFSTOPPED(status) == true.
 
 ??? Do we require __WALL usage, or will just using 0 be ok? Are the
-rules different if user wants to use waitid? Will waitid require WEXITED?
+rules different if user wants to use waitid? Will waitid require
+WEXITED?
 
 __WALL value does not include WSTOPPED and WEXITED bits, but implies
 their functionality.
 
 Setting of WCONTINUED bit in waitpid flags is not recommended: the
-continued state is per-process and consuming it would confuse real
-parent of the tracee.
+continued state is per-process and consuming it can confuse real parent
+of the tracee.
 
 Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no
 wait results available yet") even if tracer knows there should be a
@@ -134,23 +130,23 @@
 group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU,
 SYSEMU_SINGLESTEP]. They all are reported as waitpid result with
 WIFSTOPPED(status) == true. They may be differentiated by checking
-(status >> 8) value (note that WSTOPSIG(status) is (status >> 8) &
-0xff) and if looking at (status >> 8) value doesn't resolve ambiguity,
-by querying PTRACE_GETSIGINFO.
+(status >> 8) value, and if looking at (status >> 8) value doesn't
+resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note:
+WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value).
 
 
 	1.x.x Signal-delivery-stop
 
 When (possibly multi-threaded) process receives any signal except
 SIGKILL, kernel selects a thread which handles the signal (if signal is
-generated with tgkill, thread selection is done by user). If selected
+generated with t[g]kill, thread selection is done by user). If selected
 thread is traced, it enters signal-delivery-stop. By this point, signal
 is not yet delivered to the process, and can be suppressed by tracer.
 If tracer doesn't suppress the signal, it passes signal to tracee in
-the next ptrace request. This is called "signal injection" and will be
-described later. Note that if signal is blocked, signal-delivery-stop
-doesn't happen until signal is unblocked, with the usual exception that
-SIGSTOP can't be blocked.
+the next ptrace request. This second step of signal delivery is called
+"signal injection" in this document. Note that if signal is blocked,
+signal-delivery-stop doesn't happen until signal is unblocked, with the
+usual exception that SIGSTOP can't be blocked.
 
 Signal-delivery-stop is observed by tracer as waitpid returning with
 WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If
@@ -164,11 +160,13 @@
 
 After signal-delivery-stop is observed by tracer, tracer should restart
 tracee with
+
 	ptrace(PTRACE_rest, pid, 0, sig)
+
 call, where PTRACE_rest is one of the restarting ptrace ops. If sig is
 0, then signal is not delivered. Otherwise, signal sig is delivered.
-This operation is called "signal injection", to distinguish it from
-signal delivery which causes signal-delivery-stop.
+This operation is called "signal injection" in this document, to
+distinguish it from signal-delivery-stop.
 
 Note that sig value may be different from WSTOPSIG(status) value -
 tracer can cause a different signal to be injected.
@@ -221,13 +219,15 @@
 tracee only), and only after it is injected by tracer (or after it was
 dispatched to a thread which isn't traced), group-stop will be
 initiated on ALL tracees within multi-threaded process. As usual, every
-tracee reports its group-stop to corresponding tracer.
+tracee reports its group-stop separately to corresponding tracer.
 
 Group-stop is observed by tracer as waitpid returning with
 WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result
 is returned by some other classes of ptrace-stops, therefore the
 recommended practice is to perform
+
 	ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
+
 call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP,
 SIGTTIN or SIGTTOU - only these four signals are stopping signals. If
 tracer sees something else, it can't be group-stop. Otherwise, tracer
@@ -277,9 +277,11 @@
 
 PTRACE_EVENT_EXEC - stop before return from exec.
 
-PTRACE_EVENT_EXIT - stop before exit. PTRACE_GETEVENTMSG returns exit
-status. Registers can be examined (unlike when "real" exit happens).
-The tracee is still alive, it needs to be PTRACE_CONTed to finish exit.
+PTRACE_EVENT_EXIT - stop before exit (including death from exit_group),
+signal death, or exit caused by execve in multi-threaded process.
+PTRACE_GETEVENTMSG returns exit status. Registers can be examined
+(unlike when "real" exit happens). The tracee is still alive, it needs
+to be PTRACE_CONTed or PTRACE_DETACHed to finish exit.
 
 PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP,
 si_code = (event << 8) | SIGTRAP.
@@ -369,7 +371,9 @@
 
 Another group of commands makes ptrace-stopped tracee run. They have
 the form:
+
 	ptrace(PTRACE_cmd, pid, 0, sig);
+
 where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or
 SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the
 signal to be injected. Otherwise, sig may be ignored.
@@ -394,8 +398,9 @@
 
 ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a
 tracee. It continues to run (doesn't enter ptrace-stop). A common
-practice is follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and allow
-parent (which is our tracer now) to observe our signal-delivery-stop.
+practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and
+allow parent (which is our tracer now) to observe our
+signal-delivery-stop.
 
 If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect,
 then children created by (vfork or clone(CLONE_VFORK)), (fork or
@@ -435,9 +440,11 @@
 resets execve'ing thread tid to tgid (process id). This looks very
 confusing to tracers:
 
-All other threads "disappear" - that is, they terminate their execution
-without returning any waitpid notifications to anyone, even if they are
-currently traced.
+All other threads stop in PTRACE_EXIT stop, if requested by active
+ptrace option. Then all other threads except thread group leader report
+death as if they exited via exit syscall with exit code 0. Then
+PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option
+(on which tracee - leader? execve-ing one?).
 
 The execve-ing tracee changes its pid while it is in execve syscall.
 (Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace
@@ -461,35 +468,32 @@
 Pid change happens before PTRACE_EVENT_EXEC stop, not after.
 
 When tracer receives PTRACE_EVENT_EXEC stop notification, it is
-guaranteed that except this tracee, no other threads from the process
-are alive. Moreover, it is guaranteed that tracer will not receive any
-"buffered" death reports from any of them, even if some threads were
-racing with execve'ing tracee, for example were entering exit syscall.
+guaranteed that except this tracee and thread group leader, no other
+threads from the process are alive.
 
 On receiving this notification, tracer should clean up all its internal
 data structures about all threads of this process, and retain only one
 data structure, one which describes single still running tracee, with
 pid = tgid = process id.
 
-??? How tracer knows which of its many tracees _are_ threads of that
-particular process? (It may trace more than one process; it may even
-don't keep track of its tracees' thread group relations at all...)
-
-??? what happens if two threads execve at the same time? Clearly, only
-one of them succeeds, but *which* one? Think "strace -f" or
-multi-threaded process here:
+Currently, there is no way to retrieve former pid of execve-ing tracee.
+If tracer doesn't keep track of its tracees' thread group relations, it
+may be unable to know which tracee execve-ed and therefore no longer
+exists under old pid due to pid change.
+
+Example: two threads execve at the same time:
 
-  ** we get death notification: leader died: **
- PID0 exit(0)                            = ?
   ** we get syscall-entry-stop in thread 1: **
  PID1 execve("/bin/foo", "foo" <unfinished ...>
+  ** we issue PTRACE_SYSCALL for thread 1 **
   ** we get syscall-entry-stop in thread 2: **
  PID2 execve("/bin/bar", "bar" <unfinished ...>
+  ** we issue PTRACE_SYSCALL for thread 2 **
   ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
   ** we get syscall-exit-stop for PID0: **
  PID0 <... execve resumed> )             = 0
 
-??? Question: WHICH execve succeeded? Can tracer figure it out?
+In this situation there is no way to know which execve succeeded.
 
 If PTRACE_O_TRACEEXEC option is NOT in effect for the execve'ing
 tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall
@@ -523,16 +527,31 @@
 whole multi-threaded process exits). If they are the same process, the
 report is sent only once.
 
-- ??? add more docs
 
-Following bugs still exist:
+	1.x Known bugs
 
-- group-stop notifications are sent to tracer, but not to real parent.
+Following bugs still exist:
 
-- If thread group leader it is traced and exits, do_wait(WEXITED)
-doesn't work (until all threads exit) for its the tracer.
+Group-stop notifications are sent to tracer, but not to real parent.
+Last confirmed on 2.6.38.6.
 
-??? add more known bugs here
+If thread group leader is traced and exits by calling exit syscall,
+PTRACE_EVENT_EXIT stop will happen for it (if requested), but
+subsequent WIFEXITED notification will not be delivered until all other
+threads exit. As explained above, if one of other threads execve's,
+thread group leader death will *never* be reported. If execve-ed thread
+is not traced by this tracer, tracer will never know that execve
+happened.
+
+??? need to test this scenario
+
+One possible workaround is to detach thread group leader instead of
+restarting it in this case. Last confirmed on 2.6.38.6.
+
+SIGKILL signal may still cause PTRACE_EVENT_EXIT stop before actual
+signal death. This may be changed in the future - SIGKILL is meant to
+always immediately kill tasks even under ptrace. Last confirmed on
+2.6.38.6.