linux-kernel - Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YrtKReO2vIiX8VVU@tuxmaker.boeblingen.de.ibm.com>
Date:   Tue, 28 Jun 2022 20:36:53 +0200
From:   Alexander Gordeev <agordeev@...ux.ibm.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     linux-kernel@...r.kernel.org, rjw@...ysocki.net,
        Oleg Nesterov <oleg@...hat.com>, mingo@...nel.org,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, mgorman@...e.de, bigeasy@...utronix.de,
        Will Deacon <will@...nel.org>, tj@...nel.org,
        linux-pm@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
        Richard Weinberger <richard@....at>,
        Anton Ivanov <anton.ivanov@...bridgegreys.com>,
        Johannes Berg <johannes@...solutions.net>,
        linux-um@...ts.infradead.org, Chris Zankel <chris@...kel.net>,
        Max Filippov <jcmvbkbc@...il.com>,
        linux-xtensa@...ux-xtensa.org, Kees Cook <keescook@...omium.org>,
        Jann Horn <jannh@...gle.com>, linux-ia64@...r.kernel.org
Subject: Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED,
 TASK_STOPPED state

On Sat, Jun 25, 2022 at 11:34:46AM -0500, Eric W. Biederman wrote:
> I haven't gotten as far as reproducing this but I have started giving
> this issue some thought.
> 
> This entire thing smells like a memory barrier is missing somewhere.
> However by definition the lock implementations in linux provide all the
> needed memory barriers, and in the ptrace_stop and ptrace_check_attach
> path I don't see cases where these values are sampled outside of a lock
> except in wait_task_inactive.  Does doing that perhaps require a
> barrier? 
> 
> The two things I can think of that could shed light on what is going on
> is enabling lockdep, to enable the debug check in signal_wake_up_state
> and verifying bits of state that should be constant while the task
> is frozen for ptrace are indeed constant when task is frozen for ptrace.
> Something like my patch below.
> 
> If you could test that when you have a chance that would help narrow
> down what is going on.
> 
> Thank you,
> Eric
> 
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 156a99283b11..6467a2b1c3bc 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -268,9 +268,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	}
>  	read_unlock(&tasklist_lock);
>  
> -	if (!ret && !ignore_state &&
> -	    WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
> +	if (!ret && !ignore_state) {
> +		WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN));
> +		WARN_ON_ONCE(!(child->joctctl & JOBCTL_TRACED));
> +		WARN_ON_ONCE(READ_ONCE(child->__state) != __TASK_TRACED);
> +		WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED));
>  		ret = -ESRCH;
> +	}
>  
>  	return ret;
>  }

I modified your chunk a bit - hope that is what you had in mind:

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 156a99283b11..f0e9a9a4d63c 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -268,9 +268,19 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state &&
-	    WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
-		ret = -ESRCH;
+	if (!ret && !ignore_state) {
+		unsigned int __state;
+
+		WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN));
+		WARN_ON_ONCE(!(child->jobctl & JOBCTL_TRACED));
+		__state = READ_ONCE(child->__state);
+		if (__state != __TASK_TRACED) {
+			pr_err("%s(%d) __state %x", __FUNCTION__, __LINE__, __state);
+			WARN_ON_ONCE(1);
+		}
+		if (WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
+			ret = -ESRCH;
+	}
 
 	return ret;
 }


When WARN_ON_ONCE(1) hits the child __state is always zero/TASK_RUNNING,
as reported by the preceding pr_err(). Yet, in the resulting core dump
it is always __TASK_TRACED.

Removing WARN_ON_ONCE(1) while looping until (__state != __TASK_TRACED)
confirms the unexpected __state is always TASK_RUNNING. It never observed
more than one iteration and gets printed once in 30-60 mins.

So probably when the condition is entered __state is TASK_RUNNING more
often, but gets overwritten with __TASK_TRACED pretty quickly. Which kind
of consistent with my previous observation that kernel/sched/core.c:3305
is where return 0 makes wait_task_inactive() fail.

No other WARN_ON_ONCE() hit ever.