[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bkulgb7x.fsf@email.froward.int.ebiederm.org>
Date: Tue, 21 Jun 2022 12:47:30 -0500
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: Alexander Gordeev <agordeev@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, rjw@...ysocki.net,
Oleg Nesterov <oleg@...hat.com>, mingo@...nel.org,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, mgorman@...e.de, bigeasy@...utronix.de,
Will Deacon <will@...nel.org>, tj@...nel.org,
linux-pm@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
Richard Weinberger <richard@....at>,
Anton Ivanov <anton.ivanov@...bridgegreys.com>,
Johannes Berg <johannes@...solutions.net>,
linux-um@...ts.infradead.org, Chris Zankel <chris@...kel.net>,
Max Filippov <jcmvbkbc@...il.com>,
linux-xtensa@...ux-xtensa.org, Kees Cook <keescook@...omium.org>,
Jann Horn <jannh@...gle.com>, linux-ia64@...r.kernel.org
Subject: Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED,
TASK_STOPPED state
Alexander Gordeev <agordeev@...ux.ibm.com> writes:
> On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
>> Alexander Gordeev <agordeev@...ux.ibm.com> writes:
>>
>> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
>> >> From: Peter Zijlstra <peterz@...radead.org>
>> >>
>> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
>> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
>> >> state exists only in task->__state and nowhere else.
>> >>
>> >> There's two spots of bother with this:
>> >>
>> >> - PREEMPT_RT has task->saved_state which complicates matters,
>> >> meaning task_is_{traced,stopped}() needs to check an additional
>> >> variable.
>> >>
>> >> - An alternative freezer implementation that itself relies on a
>> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will
>> >> result in misbehaviour.
>> >>
>> >> As such, add additional state to task->jobctl to track this state
>> >> outside of task->__state.
>> >>
>> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
>> >>
>> >> --EWB
>> >> * didn't add a unnecessary newline in signal.h
>> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
>> >> instead of in signal_wake_up_state. This prevents the clearing
>> >> of TASK_STOPPED and TASK_TRACED from getting lost.
>> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
>> >
>> > Hi Eric, Peter,
>> >
>> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
>> > kill_child testcase from strace tool is repeatedly used (the source
>> > is attached for reference):
>> >
>> > while :; do
>> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
>> > done
>> >
>> > It normally takes few minutes to cause the warning in -rc3, but FWIW
>> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
>> >
>> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
>> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
>> > observe a child in __TASK_TRACED state. Could you please comment here?
>> >
>>
>> For clarity the warning is that the child is not in __TASK_TRACED state.
>>
>> The code is waiting for the code to stop in the scheduler in the
>> __TASK_TRACED state so that it can safely read and change the
>> processes state. Some of that state is not even saved until the
>> process is scheduled out so we have to wait until the process
>> is stopped in the scheduler.
>
> So I assume (checked actually) the return 0 below from kernel/sched/core.c:
> wait_task_inactive() is where it bails out:
>
> 3303 while (task_running(rq, p)) {
> 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> 3305 return 0;
> 3306 cpu_relax();
> 3307 }
>
> Yet, the child task is always found in __TASK_TRACED state (as seen
> in crash dumps):
>
>> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace
> 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child
> 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child
> crash> task bb04b200 __state
> PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child"
> __state = 8,
>
> crash> task d0b10100 __state
> PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child"
> __state = 8,
That is weird.
>> At least on s390 it looks like there is a race between SIGKILL and
>> ptrace_check_attach. That isn't good.
>>
>> Reading the code below there is something missing because I don't see
>> anything making ptrace calls, and ptrace_check_attach (which contains
>> the warning) only happens in the ptrace syscall.
>
> That is what I believe strace does when calling that code:
>
> strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
Thank you. That was my braino.
I will have to see if it reproduces for me on x86 (I don't have an
s390). Perhaps if I can reproduce it I can guess what is going wrong.
So far it appears WARN_ON_ONCE has nothing to warn about yet it is
warning.
Eric
Powered by blists - more mailing lists