linux-kernel - Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YrHgo8GKFPWwoBoJ@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com>
Date:   Tue, 21 Jun 2022 17:15:47 +0200
From:   Alexander Gordeev <agordeev@...ux.ibm.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     linux-kernel@...r.kernel.org, rjw@...ysocki.net,
        Oleg Nesterov <oleg@...hat.com>, mingo@...nel.org,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, mgorman@...e.de, bigeasy@...utronix.de,
        Will Deacon <will@...nel.org>, tj@...nel.org,
        linux-pm@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
        Richard Weinberger <richard@....at>,
        Anton Ivanov <anton.ivanov@...bridgegreys.com>,
        Johannes Berg <johannes@...solutions.net>,
        linux-um@...ts.infradead.org, Chris Zankel <chris@...kel.net>,
        Max Filippov <jcmvbkbc@...il.com>,
        linux-xtensa@...ux-xtensa.org, Kees Cook <keescook@...omium.org>,
        Jann Horn <jannh@...gle.com>, linux-ia64@...r.kernel.org
Subject: Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED,
 TASK_STOPPED state

On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
> Alexander Gordeev <agordeev@...ux.ibm.com> writes:
> 
> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
> >> From: Peter Zijlstra <peterz@...radead.org>
> >> 
> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
> >> state exists only in task->__state and nowhere else.
> >> 
> >> There's two spots of bother with this:
> >> 
> >>  - PREEMPT_RT has task->saved_state which complicates matters,
> >>    meaning task_is_{traced,stopped}() needs to check an additional
> >>    variable.
> >> 
> >>  - An alternative freezer implementation that itself relies on a
> >>    special TASK state would loose TASK_TRACED/TASK_STOPPED and will
> >>    result in misbehaviour.
> >> 
> >> As such, add additional state to task->jobctl to track this state
> >> outside of task->__state.
> >> 
> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
> >> 
> >> --EWB
> >>   * didn't add a unnecessary newline in signal.h
> >>   * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
> >>     instead of in signal_wake_up_state.  This prevents the clearing
> >>     of TASK_STOPPED and TASK_TRACED from getting lost.
> >>   * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
> >
> > Hi Eric, Peter,
> >
> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
> > kill_child testcase from strace tool is repeatedly used (the source
> > is attached for reference):
> >
> > while :; do
> > 	strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
> > done
> >
> > It normally takes few minutes to cause the warning in -rc3, but FWIW
> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
> >
> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
> > observe a child in __TASK_TRACED state. Could you please comment here?
> >
> 
> For clarity the warning is that the child is not in __TASK_TRACED state.
> 
> The code is waiting for the code to stop in the scheduler in the
> __TASK_TRACED state so that it can safely read and change the
> processes state.  Some of that state is not even saved until the
> process is scheduled out so we have to wait until the process
> is stopped in the scheduler.

So I assume (checked actually) the return 0 below from kernel/sched/core.c:
wait_task_inactive() is where it bails out:

3303                 while (task_running(rq, p)) {
3304                         if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
3305                                 return 0;
3306                         cpu_relax();
3307                 }

Yet, the child task is always found in __TASK_TRACED state (as seen
in crash dumps):

> 101447  11342  13      ce3a8100      RU   0.0   10040   4412  strace
  101450  101447   0      bb04b200      TR   0.0    2272   1136  kill_child
  108261  101447   2      d0b10100      TR   0.0    2272    532  kill_child
crash> task bb04b200 __state
PID: 101450  TASK: bb04b200          CPU: 0   COMMAND: "kill_child"
  __state = 8,

crash> task d0b10100 __state
PID: 108261  TASK: d0b10100          CPU: 2   COMMAND: "kill_child"
  __state = 8,

> At least on s390 it looks like there is a race between SIGKILL and
> ptrace_check_attach.  That isn't good.
>
> Reading the code below there is something missing because I don't see
> anything making ptrace calls, and ptrace_check_attach (which contains
> the warning) only happens in the ptrace syscall.

That is what I believe strace does when calling that code:

 	strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child

> Eric
> 
> 
> 
> > Thanks!
> >
> > /*
> >  * Check for the corner case that previously lead to segfault
> >  * due to an attempt to access unitialised tcp->s_ent.
> >  *
> >  * 13994 ????( <unfinished ...>
> >  * ...
> >  * 13994 <... ???? resumed>) = ?
> >  *
> >  * Copyright (c) 2019 The strace developers.
> >  * All rights reserved.
> >  *
> >  * SPDX-License-Identifier: GPL-2.0-or-later
> >  */
> >
> > #include "tests.h"
> >
> > #include <sched.h>
> > #include <signal.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> > #include <sys/wait.h>
> >
> > #define ITERS    10000
> > #define SC_ITERS 10000
> >
> > int
> > main(void)
> > {
> > 	volatile sig_atomic_t *const mem =
> > 		mmap(NULL, get_page_size(), PROT_READ | PROT_WRITE,
> > 		     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> > 	if (mem == MAP_FAILED)
> > 		perror_msg_and_fail("mmap");
> >
> > 	for (unsigned int i = 0; i < ITERS; ++i) {
> > 		mem[0] = mem[1] = 0;
> >
> > 		const pid_t pid = fork();
> > 		if (pid < 0)
> > 			perror_msg_and_fail("fork");
> >
> > 		if (!pid) {
> > 			/* wait for the parent */
> > 			while (!mem[0])
> > 				;
> > 			/* let the parent know we are running */
> > 			mem[1] = 1;
> >
> > 			for (unsigned int j = 0; j < SC_ITERS; j++)
> > 				sched_yield();
> >
> > 			pause();
> > 			return 0;
> > 		}
> >
> > 		/* let the child know we are running */
> > 		mem[0] = 1;
> > 		/* wait for the child */
> > 		while (!mem[1])
> > 			;
> >
> > 		if (kill(pid, SIGKILL))
> > 			perror_msg_and_fail("kill");
> > 		if (wait(NULL) != pid)
> > 			perror_msg_and_fail("wait");
> > 	}
> >
> > 	return 0;
> > }