[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220117164633.322550-1-valentin.schneider@arm.com>
Date: Mon, 17 Jan 2022 16:46:31 +0000
From: Valentin Schneider <valentin.schneider@....com>
To: linux-kernel@...r.kernel.org
Cc: Abhijeet Dharmapurikar <adharmap@...cinc.com>,
Uwe Kleine-König
<u.kleine-koenig@...gutronix.de>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Thomas Gleixner <tglx@...utronix.de>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Juri Lelli <juri.lelli@...hat.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Kees Cook <keescook@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Alexey Gladkov <legion@...nel.org>,
"Kenta.Tada@...y.com" <Kenta.Tada@...y.com>,
Randy Dunlap <rdunlap@...radead.org>,
Ed Tsai <ed.tsai@...iatek.com>
Subject: [PATCH v2 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
Hi folks,
Problem
=======
Abhijeet pointed out that the following sequence of trace events can be
observed:
cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
<idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen:
p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;
TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of
smp_cond_load_acquire(&p->on_cpu, !VAL);
in ttwu().
Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.
Both of those are very confusing for tooling or even human observers.
Patches
=======
For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.
For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
TASK_RUNNING.
Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.
Testing
=======
Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@...62d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT
I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).
Revisions
=========
RFC v1 -> v2
++++++++++++
o Collected tags (Steven, Sebastian)
o Patched missing trace_switch event clients (Steven)
Cheers,
Valentin
Valentin Schneider (2):
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT
fs/proc/array.c | 3 ++-
include/linux/sched.h | 28 +++++++++++++++++-----------
include/trace/events/sched.h | 12 ++++++++----
kernel/sched/core.c | 4 ++--
kernel/trace/fgraph.c | 4 +++-
kernel/trace/ftrace.c | 4 +++-
kernel/trace/trace_events.c | 8 ++++++--
kernel/trace/trace_osnoise.c | 4 +++-
kernel/trace/trace_sched_switch.c | 1 +
kernel/trace/trace_sched_wakeup.c | 1 +
10 files changed, 46 insertions(+), 23 deletions(-)
--
2.25.1
Powered by blists - more mailing lists