linux-kernel - [RFC] sched: Improving scheduler debugging/tracing interfaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250402122607.2982523-1-luis.machado@arm.com>
Date: Wed, 2 Apr 2025 13:26:07 +0100
From: Luis Machado <luis.machado@....com>
To: <linux-kernel@...r.kernel.org>, <peterz@...radead.org>,
	<dietmar.eggemann@....com>
CC: <mingo@...hat.com>, <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<vschneid@...hat.com>
Subject: [RFC] sched: Improving scheduler debugging/tracing interfaces

This patch is primarily a proposal aimed at improving the observability of the
scheduler, especially in the context of energy-aware scheduling, without
introducing long-term maintenance burdens such as a stable userspace ABI. I’m
seeking feedback from the community on whether this approach seems viable, or
if there are suggestions for making it more robust or maintainable.

Today, validating that a set of scheduler changes behaves sanely and doesn’t
regress performance or energy metrics can be time-consuming. On the
energy-aware side in particular, this often requires significant manual
intervention to collect, post-process, and analyze data.

Another challenge is the limited availability of platforms that can run a
mainline kernel while still exposing the detailed data we need. While we do
have some options, most devices running upstream kernels don’t provide as
much — or as precise — information as we’d like. The most data-rich devices
tend to be phones or Android-based systems, which typically run slightly
older or patched kernels, adding yet another layer of complexity.

As a result, going from reviewing a patch series on LKML to having a concrete
good/bad/neutral result often involves several intermediate steps and tooling
hurdles.

Our current data collection relies heavily on existing kernel tracepoints and
trace events. However, adding new trace events is increasingly discouraged,
since these are often treated as part of a de facto userspace ABI — something
we want to avoid maintaining long-term. So extending the trace events set isn’t
a viable option.

To work around this, we use a kernel module (LISA) that defines its own trace
events based on existing scheduler tracepoints. This approach gives us
flexibility in creating events without modifying the kernel’s core trace
infrastructure or establishing any new userspace ABI.

For the past few years, tracepoint definitions for the scheduler have been
exposed in include/trace/events/sched.h. These definitions are not always
made available via tracefs, and are documented as being for testing and
debugging purposes — which aligns well with our use case.

However, this approach has limitations. One issue is the visibility of
tracepoint argument types. If a tracepoint uses a public type defined in a
public header, we can dereference members directly to extract data. But if
the type is internal or opaque — such as struct rq — we can’t access its
contents, which prevents us from retrieving useful values like the CPU number.

One workaround is to duplicate the kernel’s internal struct definitions in
the module, but this is not good: it’s error-prone due to alignment issues and
requires constant tracking of kernel changes to avoid mismatches.

A better approach, which we currently use, is to rely on BTF (BPF Type
Format) to reconstruct type information. BTF allows us to access internal
kernel types without having to maintain duplicate struct definitions. As long
as BTF info is available, we can introspect data structures even if they’re
not publicly defined.

Using this, our module can define trace events and dereference internal types
to extract data — but it’s not without friction:

- Struct members are often nested deeply within BTF type trees, which can make
it awkward to navigate and extract data.

- BTF describes data types, but not semantics. For example, sched_avg.util_est
appears to be a numeric value, but in reality it encodes a flag alongside the
actual utilization value. The kernel uses the following helper to extract the
actual data:

static inline unsigned long _task_util_est(struct task_struct *p)
{
    return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
}

There is no way to infer from BTF alone that this masking is needed. And even
when such helpers exist in the kernel, they’re often inlined or unavailable
to modules, so we’d have to reimplement them — again reintroducing
maintenance overhead.

To address these challenges and reduce duplication, we propose adding an
extra argument to certain scheduler tracepoints: a pointer to a struct of
function pointers (callbacks). These callbacks would act as "getters" that
the module could use to fetch internal data in a safe, forward-compatible
way.

For example, to extract the CPU capacity from a struct rq (which is opaque to
the module), the module could call a getter function via the callback struct.
These functions would reside inside the kernel, and could leverage internal
knowledge, including inlined helpers and static data.

Here's an example of the proposed callback structure:

struct sched_tp_callbacks {
    /* Fetches the util_est from a cfs_rq. */
    unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);

    /* Fetches the util_est from a sched_entity. */
    unsigned int (*se_util_est)(struct sched_entity *se);

    /* Fetches the current CPU capacity from an rq. */
    unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
};

The idea is simple: given a base type (e.g. rq, cfs_rq, sched_entity), the
module calls a getter function that returns the data it needs. These getters
encapsulate internal kernel logic and remove the need for the module to
replicate or guess how to access scheduler internals.

Since these additions would be part of tracepoints used for
testing/debugging, they are not considered stable ABI and can evolve as the
kernel changes. It would be up to the module to adapt to changes in available
hooks, types, or fields — something we already do today using BTF for
disappearing types (e.g. struct util_est becoming a raw integer).

While this approach would require some extra code in the kernel to define the
callback struct and register the functions, we believe it would significantly
improve testability and maintainability of tooling like LISA. It could even
be extended to support non-energy-aware scheduler debugging scenarios as
well.

Our current testing pipeline already makes heavy use of LISA [1], which
automates test execution and data analysis. It also integrates with rt-app
[2] to generate configurable workloads.

The attached proof-of-concept patch adds three such callback functions as a
demonstration. We’ve tested this against a modified version of our module
that uses the callbacks to fetch scheduler internals.

Thoughts?

[1] https://tooling.sites.arm.com/lisa/latest/
[2] https://github.com/scheduler-tools/rt-app

Signed-off-by: Luis Machado <luis.machado@....com>
---
 include/trace/events/sched.h | 27 +++++++++++++++++----------
 kernel/sched/core.c          | 27 +++++++++++++++++++++++++++
 kernel/sched/fair.c          | 16 ++++++++--------
 kernel/sched/pelt.c          |  6 +++---
 kernel/sched/sched.h         | 11 +++++++++++
 5 files changed, 66 insertions(+), 21 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c..0687f4f62d9 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -767,6 +767,13 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
 	TP_printk("cpu=%d", __entry->cpu)
 );
 
+/* This struct is declared here so the tracepoints below can pass
+ * these types as parameter.
+ * This is only used for testing and debugging, so tracepoint probes can
+ * use the callbacks to fetch the data they need.
+ */
+struct sched_tp_callbacks;
+
 /*
  * Following tracepoints are not exported in tracefs and provide hooking
  * mechanisms only for testing and debugging purposes.
@@ -774,8 +781,8 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
  * Postfixed with _tp to make them easily identifiable in the code.
  */
 DECLARE_TRACE(pelt_cfs_tp,
-	TP_PROTO(struct cfs_rq *cfs_rq),
-	TP_ARGS(cfs_rq));
+	TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
+	TP_ARGS(cfs_rq, sched_tp_callbacks));
 
 DECLARE_TRACE(pelt_rt_tp,
 	TP_PROTO(struct rq *rq),
@@ -794,24 +801,24 @@ DECLARE_TRACE(pelt_irq_tp,
 	TP_ARGS(rq));
 
 DECLARE_TRACE(pelt_se_tp,
-	TP_PROTO(struct sched_entity *se),
-	TP_ARGS(se));
+	TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
+	TP_ARGS(se, sched_tp_callbacks));
 
 DECLARE_TRACE(sched_cpu_capacity_tp,
-	TP_PROTO(struct rq *rq),
-	TP_ARGS(rq));
+	TP_PROTO(struct rq *rq, struct sched_tp_callbacks *sched_tp_callbacks),
+	TP_ARGS(rq, sched_tp_callbacks));
 
 DECLARE_TRACE(sched_overutilized_tp,
 	TP_PROTO(struct root_domain *rd, bool overutilized),
 	TP_ARGS(rd, overutilized));
 
 DECLARE_TRACE(sched_util_est_cfs_tp,
-	TP_PROTO(struct cfs_rq *cfs_rq),
-	TP_ARGS(cfs_rq));
+	TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
+	TP_ARGS(cfs_rq, sched_tp_callbacks));
 
 DECLARE_TRACE(sched_util_est_se_tp,
-	TP_PROTO(struct sched_entity *se),
-	TP_ARGS(se));
+	TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
+	TP_ARGS(se, sched_tp_callbacks));
 
 DECLARE_TRACE(sched_update_nr_running_tp,
 	TP_PROTO(struct rq *rq, int change),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba..a0ee7534aaa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -99,6 +99,27 @@
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
 
+static unsigned int cfs_rq_util_est(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq ? cfs_rq->avg.util_est: 0;
+}
+
+static unsigned int se_util_est(struct sched_entity *se)
+{
+
+	return se ? se->avg.util_est & ~UTIL_AVG_UNCHANGED : 0;
+}
+
+static unsigned long rq_cpu_current_capacity(struct rq *rq)
+{
+	if (rq == NULL)
+		return 0;
+
+	unsigned long capacity_orig = per_cpu(cpu_scale, rq->cpu);
+	unsigned long scale_freq = per_cpu(arch_freq_scale, rq->cpu);
+	return cap_scale(capacity_orig, scale_freq);
+}
+
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -8524,11 +8545,17 @@ LIST_HEAD(task_groups);
 static struct kmem_cache *task_group_cache __ro_after_init;
 #endif
 
+struct sched_tp_callbacks sched_tp_callbacks;
+
 void __init sched_init(void)
 {
 	unsigned long ptr = 0;
 	int i;
 
+	sched_tp_callbacks.cfs_rq_util_est = cfs_rq_util_est;
+	sched_tp_callbacks.se_util_est = se_util_est;
+	sched_tp_callbacks.rq_cpu_current_capacity = rq_cpu_current_capacity;
+
 	/* Make sure the linker didn't screw up */
 #ifdef CONFIG_SMP
 	BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e58..f115842669b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4433,8 +4433,8 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
-	trace_pelt_cfs_tp(cfs_rq);
-	trace_pelt_se_tp(se);
+	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
+	trace_pelt_se_tp(se, &sched_tp_callbacks);
 
 	return 1;
 }
@@ -4698,7 +4698,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	cfs_rq_util_change(cfs_rq, 0);
 
-	trace_pelt_cfs_tp(cfs_rq);
+	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
 }
 
 /**
@@ -4728,7 +4728,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	cfs_rq_util_change(cfs_rq, 0);
 
-	trace_pelt_cfs_tp(cfs_rq);
+	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
 }
 
 /*
@@ -4865,7 +4865,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 	enqueued += _task_util_est(p);
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
 
-	trace_sched_util_est_cfs_tp(cfs_rq);
+	trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
 }
 
 static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
@@ -4881,7 +4881,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
 	enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
 
-	trace_sched_util_est_cfs_tp(cfs_rq);
+	trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
 }
 
 #define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
@@ -4970,7 +4970,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	ewma |= UTIL_AVG_UNCHANGED;
 	WRITE_ONCE(p->se.avg.util_est, ewma);
 
-	trace_sched_util_est_se_tp(&p->se);
+	trace_sched_util_est_se_tp(&p->se, &sched_tp_callbacks);
 }
 
 static inline unsigned long get_actual_cpu_capacity(int cpu)
@@ -10009,7 +10009,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 		capacity = 1;
 
 	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	trace_sched_cpu_capacity_tp(cpu_rq(cpu), &sched_tp_callbacks);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 7a8534a2def..7ca37abf46b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -296,7 +296,7 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
-		trace_pelt_se_tp(se);
+		trace_pelt_se_tp(se, &sched_tp_callbacks);
 		return 1;
 	}
 
@@ -310,7 +310,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
-		trace_pelt_se_tp(se);
+		trace_pelt_se_tp(se, &sched_tp_callbacks);
 		return 1;
 	}
 
@@ -325,7 +325,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
-		trace_pelt_cfs_tp(cfs_rq);
+		trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
 		return 1;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea7..5b0b8bb460c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -182,6 +182,17 @@ extern struct list_head asym_cap_list;
  */
 #define RUNTIME_INF		((u64)~0ULL)
 
+struct sched_tp_callbacks {
+	/* Fetches the util_est from a cfs_rq. */
+	unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);
+	/* Fetches the util_est from a sched_entity. */
+	unsigned int (*se_util_est)(struct sched_entity *se);
+	/* Fetches the current cpu capacity out of a rq. */
+	unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
+};
+
+extern struct sched_tp_callbacks sched_tp_callbacks;
+
 static inline int idle_policy(int policy)
 {
 	return policy == SCHED_IDLE;

base-commit: 328802738e1cd091d04076317f3c2174125c5916
-- 
2.25.1