[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250402180925.90914-1-andrii@kernel.org>
Date: Wed, 2 Apr 2025 11:09:25 -0700
From: Andrii Nakryiko <andrii@...nel.org>
To: linux-trace-kernel@...r.kernel.org,
peterz@...radead.org,
mingo@...nel.org
Cc: bpf@...r.kernel.org,
linux-kernel@...r.kernel.org,
kernel-team@...a.com,
mhocko@...nel.org,
rostedt@...dmis.org,
oleg@...hat.com,
brauner@...nel.org,
glider@...gle.com,
mhiramat@...nel.org,
mathieu.desnoyers@...icios.com,
akpm@...ux-foundation.org,
Andrii Nakryiko <andrii@...nel.org>
Subject: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.
There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.
Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.
Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.
Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
---
include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
kernel/exit.c | 2 +-
2 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..05a14f2b35c3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
/*
* Tracepoint for a task exiting:
*/
-DEFINE_EVENT(sched_process_template, sched_process_exit,
- TP_PROTO(struct task_struct *p),
- TP_ARGS(p));
+TRACE_EVENT(sched_process_exit,
+
+ TP_PROTO(struct task_struct *p, bool group_dead),
+
+ TP_ARGS(p, group_dead),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, prio )
+ __field( bool, group_dead )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+ __entry->pid = p->pid;
+ __entry->prio = p->prio; /* XXX SCHED_DEADLINE */
+ __entry->group_dead = group_dead;
+ ),
+
+ TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
+ __entry->comm, __entry->pid, __entry->prio,
+ __entry->group_dead ? "true" : "false"
+ )
+);
/*
* Tracepoint for waiting on task to unschedule:
diff --git a/kernel/exit.c b/kernel/exit.c
index c2e6c7b7779f..4abd307b1586 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ trace_sched_process_exit(tsk, group_dead);
exit_mm();
if (group_dead)
acct_process();
- trace_sched_process_exit(tsk);
exit_sem(tsk);
exit_shm(tsk);
--
2.47.1
Powered by blists - more mailing lists