[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <725629dc-b1a4-42d5-81c6-ae09dc5f9444@arm.com>
Date: Wed, 2 Apr 2025 16:22:11 +0100
From: Luis Machado <luis.machado@....com>
To: linux-kernel@...r.kernel.org, peterz@...radead.org,
dietmar.eggemann@....com
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com
Subject: Re: [RFC] sched: Improving scheduler debugging/tracing interfaces
On 4/2/25 13:26, Luis Machado wrote:
> This patch is primarily a proposal aimed at improving the observability of the
> scheduler, especially in the context of energy-aware scheduling, without
> introducing long-term maintenance burdens such as a stable userspace ABI. I’m
> seeking feedback from the community on whether this approach seems viable, or
> if there are suggestions for making it more robust or maintainable.
>
> Today, validating that a set of scheduler changes behaves sanely and doesn’t
> regress performance or energy metrics can be time-consuming. On the
> energy-aware side in particular, this often requires significant manual
> intervention to collect, post-process, and analyze data.
>
> Another challenge is the limited availability of platforms that can run a
> mainline kernel while still exposing the detailed data we need. While we do
> have some options, most devices running upstream kernels don’t provide as
> much — or as precise — information as we’d like. The most data-rich devices
> tend to be phones or Android-based systems, which typically run slightly
> older or patched kernels, adding yet another layer of complexity.
>
> As a result, going from reviewing a patch series on LKML to having a concrete
> good/bad/neutral result often involves several intermediate steps and tooling
> hurdles.
>
> Our current data collection relies heavily on existing kernel tracepoints and
> trace events. However, adding new trace events is increasingly discouraged,
> since these are often treated as part of a de facto userspace ABI — something
> we want to avoid maintaining long-term. So extending the trace events set isn’t
> a viable option.
>
> To work around this, we use a kernel module (LISA) that defines its own trace
> events based on existing scheduler tracepoints. This approach gives us
> flexibility in creating events without modifying the kernel’s core trace
> infrastructure or establishing any new userspace ABI.
>
> For the past few years, tracepoint definitions for the scheduler have been
> exposed in include/trace/events/sched.h. These definitions are not always
> made available via tracefs, and are documented as being for testing and
> debugging purposes — which aligns well with our use case.
>
> However, this approach has limitations. One issue is the visibility of
> tracepoint argument types. If a tracepoint uses a public type defined in a
> public header, we can dereference members directly to extract data. But if
> the type is internal or opaque — such as struct rq — we can’t access its
> contents, which prevents us from retrieving useful values like the CPU number.
>
> One workaround is to duplicate the kernel’s internal struct definitions in
> the module, but this is not good: it’s error-prone due to alignment issues and
> requires constant tracking of kernel changes to avoid mismatches.
>
> A better approach, which we currently use, is to rely on BTF (BPF Type
> Format) to reconstruct type information. BTF allows us to access internal
> kernel types without having to maintain duplicate struct definitions. As long
> as BTF info is available, we can introspect data structures even if they’re
> not publicly defined.
>
> Using this, our module can define trace events and dereference internal types
> to extract data — but it’s not without friction:
>
> - Struct members are often nested deeply within BTF type trees, which can make
> it awkward to navigate and extract data.
>
> - BTF describes data types, but not semantics. For example, sched_avg.util_est
> appears to be a numeric value, but in reality it encodes a flag alongside the
> actual utilization value. The kernel uses the following helper to extract the
> actual data:
>
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
> return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> }
>
> There is no way to infer from BTF alone that this masking is needed. And even
> when such helpers exist in the kernel, they’re often inlined or unavailable
> to modules, so we’d have to reimplement them — again reintroducing
> maintenance overhead.
>
> To address these challenges and reduce duplication, we propose adding an
> extra argument to certain scheduler tracepoints: a pointer to a struct of
> function pointers (callbacks). These callbacks would act as "getters" that
> the module could use to fetch internal data in a safe, forward-compatible
> way.
>
> For example, to extract the CPU capacity from a struct rq (which is opaque to
> the module), the module could call a getter function via the callback struct.
> These functions would reside inside the kernel, and could leverage internal
> knowledge, including inlined helpers and static data.
>
> Here's an example of the proposed callback structure:
>
> struct sched_tp_callbacks {
> /* Fetches the util_est from a cfs_rq. */
> unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);
>
> /* Fetches the util_est from a sched_entity. */
> unsigned int (*se_util_est)(struct sched_entity *se);
>
> /* Fetches the current CPU capacity from an rq. */
> unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
> };
>
> The idea is simple: given a base type (e.g. rq, cfs_rq, sched_entity), the
> module calls a getter function that returns the data it needs. These getters
> encapsulate internal kernel logic and remove the need for the module to
> replicate or guess how to access scheduler internals.
>
> Since these additions would be part of tracepoints used for
> testing/debugging, they are not considered stable ABI and can evolve as the
> kernel changes. It would be up to the module to adapt to changes in available
> hooks, types, or fields — something we already do today using BTF for
> disappearing types (e.g. struct util_est becoming a raw integer).
>
> While this approach would require some extra code in the kernel to define the
> callback struct and register the functions, we believe it would significantly
> improve testability and maintainability of tooling like LISA. It could even
> be extended to support non-energy-aware scheduler debugging scenarios as
> well.
>
> Our current testing pipeline already makes heavy use of LISA [1], which
> automates test execution and data analysis. It also integrates with rt-app
> [2] to generate configurable workloads.
>
> The attached proof-of-concept patch adds three such callback functions as a
> demonstration. We’ve tested this against a modified version of our module
> that uses the callbacks to fetch scheduler internals.
>
> Thoughts?
>
> [1] https://tooling.sites.arm.com/lisa/latest/
> [2] https://github.com/scheduler-tools/rt-app
>
> Signed-off-by: Luis Machado <luis.machado@....com>
> ---
> include/trace/events/sched.h | 27 +++++++++++++++++----------
> kernel/sched/core.c | 27 +++++++++++++++++++++++++++
> kernel/sched/fair.c | 16 ++++++++--------
> kernel/sched/pelt.c | 6 +++---
> kernel/sched/sched.h | 11 +++++++++++
> 5 files changed, 66 insertions(+), 21 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c..0687f4f62d9 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -767,6 +767,13 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
> TP_printk("cpu=%d", __entry->cpu)
> );
>
> +/* This struct is declared here so the tracepoints below can pass
> + * these types as parameter.
> + * This is only used for testing and debugging, so tracepoint probes can
> + * use the callbacks to fetch the data they need.
> + */
> +struct sched_tp_callbacks;
> +
> /*
> * Following tracepoints are not exported in tracefs and provide hooking
> * mechanisms only for testing and debugging purposes.
> @@ -774,8 +781,8 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
> * Postfixed with _tp to make them easily identifiable in the code.
> */
> DECLARE_TRACE(pelt_cfs_tp,
> - TP_PROTO(struct cfs_rq *cfs_rq),
> - TP_ARGS(cfs_rq));
> + TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
> + TP_ARGS(cfs_rq, sched_tp_callbacks));
>
> DECLARE_TRACE(pelt_rt_tp,
> TP_PROTO(struct rq *rq),
> @@ -794,24 +801,24 @@ DECLARE_TRACE(pelt_irq_tp,
> TP_ARGS(rq));
>
> DECLARE_TRACE(pelt_se_tp,
> - TP_PROTO(struct sched_entity *se),
> - TP_ARGS(se));
> + TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
> + TP_ARGS(se, sched_tp_callbacks));
>
> DECLARE_TRACE(sched_cpu_capacity_tp,
> - TP_PROTO(struct rq *rq),
> - TP_ARGS(rq));
> + TP_PROTO(struct rq *rq, struct sched_tp_callbacks *sched_tp_callbacks),
> + TP_ARGS(rq, sched_tp_callbacks));
>
> DECLARE_TRACE(sched_overutilized_tp,
> TP_PROTO(struct root_domain *rd, bool overutilized),
> TP_ARGS(rd, overutilized));
>
> DECLARE_TRACE(sched_util_est_cfs_tp,
> - TP_PROTO(struct cfs_rq *cfs_rq),
> - TP_ARGS(cfs_rq));
> + TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
> + TP_ARGS(cfs_rq, sched_tp_callbacks));
>
> DECLARE_TRACE(sched_util_est_se_tp,
> - TP_PROTO(struct sched_entity *se),
> - TP_ARGS(se));
> + TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
> + TP_ARGS(se, sched_tp_callbacks));
>
> DECLARE_TRACE(sched_update_nr_running_tp,
> TP_PROTO(struct rq *rq, int change),
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba..a0ee7534aaa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -99,6 +99,27 @@
> EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
> EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
>
> +static unsigned int cfs_rq_util_est(struct cfs_rq *cfs_rq)
> +{
> + return cfs_rq ? cfs_rq->avg.util_est: 0;
> +}
> +
> +static unsigned int se_util_est(struct sched_entity *se)
> +{
> +
> + return se ? se->avg.util_est & ~UTIL_AVG_UNCHANGED : 0;
> +}
> +
> +static unsigned long rq_cpu_current_capacity(struct rq *rq)
> +{
> + if (rq == NULL)
> + return 0;
> +
> + unsigned long capacity_orig = per_cpu(cpu_scale, rq->cpu);
> + unsigned long scale_freq = per_cpu(arch_freq_scale, rq->cpu);
> + return cap_scale(capacity_orig, scale_freq);
> +}
> +
> /*
> * Export tracepoints that act as a bare tracehook (ie: have no trace event
> * associated with them) to allow external modules to probe them.
> @@ -8524,11 +8545,17 @@ LIST_HEAD(task_groups);
> static struct kmem_cache *task_group_cache __ro_after_init;
> #endif
>
> +struct sched_tp_callbacks sched_tp_callbacks;
> +
> void __init sched_init(void)
> {
> unsigned long ptr = 0;
> int i;
>
> + sched_tp_callbacks.cfs_rq_util_est = cfs_rq_util_est;
> + sched_tp_callbacks.se_util_est = se_util_est;
> + sched_tp_callbacks.rq_cpu_current_capacity = rq_cpu_current_capacity;
> +
> /* Make sure the linker didn't screw up */
> #ifdef CONFIG_SMP
> BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e58..f115842669b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4433,8 +4433,8 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
> update_tg_cfs_load(cfs_rq, se, gcfs_rq);
>
> - trace_pelt_cfs_tp(cfs_rq);
> - trace_pelt_se_tp(se);
> + trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
> + trace_pelt_se_tp(se, &sched_tp_callbacks);
>
> return 1;
> }
> @@ -4698,7 +4698,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> cfs_rq_util_change(cfs_rq, 0);
>
> - trace_pelt_cfs_tp(cfs_rq);
> + trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
> }
>
> /**
> @@ -4728,7 +4728,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> cfs_rq_util_change(cfs_rq, 0);
>
> - trace_pelt_cfs_tp(cfs_rq);
> + trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
> }
>
> /*
> @@ -4865,7 +4865,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> enqueued += _task_util_est(p);
> WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>
> - trace_sched_util_est_cfs_tp(cfs_rq);
> + trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
> }
>
> static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> @@ -4881,7 +4881,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
> WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>
> - trace_sched_util_est_cfs_tp(cfs_rq);
> + trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
> }
>
> #define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> @@ -4970,7 +4970,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> ewma |= UTIL_AVG_UNCHANGED;
> WRITE_ONCE(p->se.avg.util_est, ewma);
>
> - trace_sched_util_est_se_tp(&p->se);
> + trace_sched_util_est_se_tp(&p->se, &sched_tp_callbacks);
> }
>
> static inline unsigned long get_actual_cpu_capacity(int cpu)
> @@ -10009,7 +10009,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> capacity = 1;
>
> cpu_rq(cpu)->cpu_capacity = capacity;
> - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> + trace_sched_cpu_capacity_tp(cpu_rq(cpu), &sched_tp_callbacks);
>
> sdg->sgc->capacity = capacity;
> sdg->sgc->min_capacity = capacity;
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 7a8534a2def..7ca37abf46b 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -296,7 +296,7 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
> {
> if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
> ___update_load_avg(&se->avg, se_weight(se));
> - trace_pelt_se_tp(se);
> + trace_pelt_se_tp(se, &sched_tp_callbacks);
> return 1;
> }
>
> @@ -310,7 +310,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>
> ___update_load_avg(&se->avg, se_weight(se));
> cfs_se_util_change(&se->avg);
> - trace_pelt_se_tp(se);
> + trace_pelt_se_tp(se, &sched_tp_callbacks);
> return 1;
> }
>
> @@ -325,7 +325,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
> cfs_rq->curr != NULL)) {
>
> ___update_load_avg(&cfs_rq->avg, 1);
> - trace_pelt_cfs_tp(cfs_rq);
> + trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
> return 1;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34ea7..5b0b8bb460c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -182,6 +182,17 @@ extern struct list_head asym_cap_list;
> */
> #define RUNTIME_INF ((u64)~0ULL)
>
> +struct sched_tp_callbacks {
> + /* Fetches the util_est from a cfs_rq. */
> + unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);
> + /* Fetches the util_est from a sched_entity. */
> + unsigned int (*se_util_est)(struct sched_entity *se);
> + /* Fetches the current cpu capacity out of a rq. */
> + unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
> +};
> +
> +extern struct sched_tp_callbacks sched_tp_callbacks;
> +
> static inline int idle_policy(int policy)
> {
> return policy == SCHED_IDLE;
>
> base-commit: 328802738e1cd091d04076317f3c2174125c5916
And I see the kernel test robot has an issue building on i386 that I need to look at first.
cpu_scale and arch_freq_scale are not defined there.
Powered by blists - more mailing lists