linux-kernel - Re: [RFC] sched: Improving scheduler debugging/tracing interfaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <725629dc-b1a4-42d5-81c6-ae09dc5f9444@arm.com>
Date: Wed, 2 Apr 2025 16:22:11 +0100
From: Luis Machado <luis.machado@....com>
To: linux-kernel@...r.kernel.org, peterz@...radead.org,
 dietmar.eggemann@....com
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com
Subject: Re: [RFC] sched: Improving scheduler debugging/tracing interfaces

On 4/2/25 13:26, Luis Machado wrote:
> This patch is primarily a proposal aimed at improving the observability of the
> scheduler, especially in the context of energy-aware scheduling, without
> introducing long-term maintenance burdens such as a stable userspace ABI. I’m
> seeking feedback from the community on whether this approach seems viable, or
> if there are suggestions for making it more robust or maintainable.
> 
> Today, validating that a set of scheduler changes behaves sanely and doesn’t
> regress performance or energy metrics can be time-consuming. On the
> energy-aware side in particular, this often requires significant manual
> intervention to collect, post-process, and analyze data.
> 
> Another challenge is the limited availability of platforms that can run a
> mainline kernel while still exposing the detailed data we need. While we do
> have some options, most devices running upstream kernels don’t provide as
> much — or as precise — information as we’d like. The most data-rich devices
> tend to be phones or Android-based systems, which typically run slightly
> older or patched kernels, adding yet another layer of complexity.
> 
> As a result, going from reviewing a patch series on LKML to having a concrete
> good/bad/neutral result often involves several intermediate steps and tooling
> hurdles.
> 
> Our current data collection relies heavily on existing kernel tracepoints and
> trace events. However, adding new trace events is increasingly discouraged,
> since these are often treated as part of a de facto userspace ABI — something
> we want to avoid maintaining long-term. So extending the trace events set isn’t
> a viable option.
> 
> To work around this, we use a kernel module (LISA) that defines its own trace
> events based on existing scheduler tracepoints. This approach gives us
> flexibility in creating events without modifying the kernel’s core trace
> infrastructure or establishing any new userspace ABI.
> 
> For the past few years, tracepoint definitions for the scheduler have been
> exposed in include/trace/events/sched.h. These definitions are not always
> made available via tracefs, and are documented as being for testing and
> debugging purposes — which aligns well with our use case.
> 
> However, this approach has limitations. One issue is the visibility of
> tracepoint argument types. If a tracepoint uses a public type defined in a
> public header, we can dereference members directly to extract data. But if
> the type is internal or opaque — such as struct rq — we can’t access its
> contents, which prevents us from retrieving useful values like the CPU number.
> 
> One workaround is to duplicate the kernel’s internal struct definitions in
> the module, but this is not good: it’s error-prone due to alignment issues and
> requires constant tracking of kernel changes to avoid mismatches.
> 
> A better approach, which we currently use, is to rely on BTF (BPF Type
> Format) to reconstruct type information. BTF allows us to access internal
> kernel types without having to maintain duplicate struct definitions. As long
> as BTF info is available, we can introspect data structures even if they’re
> not publicly defined.
> 
> Using this, our module can define trace events and dereference internal types
> to extract data — but it’s not without friction:
> 
> - Struct members are often nested deeply within BTF type trees, which can make
> it awkward to navigate and extract data.
> 
> - BTF describes data types, but not semantics. For example, sched_avg.util_est
> appears to be a numeric value, but in reality it encodes a flag alongside the
> actual utilization value. The kernel uses the following helper to extract the
> actual data:
> 
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
>     return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> }
> 
> There is no way to infer from BTF alone that this masking is needed. And even
> when such helpers exist in the kernel, they’re often inlined or unavailable
> to modules, so we’d have to reimplement them — again reintroducing
> maintenance overhead.
> 
> To address these challenges and reduce duplication, we propose adding an
> extra argument to certain scheduler tracepoints: a pointer to a struct of
> function pointers (callbacks). These callbacks would act as "getters" that
> the module could use to fetch internal data in a safe, forward-compatible
> way.
> 
> For example, to extract the CPU capacity from a struct rq (which is opaque to
> the module), the module could call a getter function via the callback struct.
> These functions would reside inside the kernel, and could leverage internal
> knowledge, including inlined helpers and static data.
> 
> Here's an example of the proposed callback structure:
> 
> struct sched_tp_callbacks {
>     /* Fetches the util_est from a cfs_rq. */
>     unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);
> 
>     /* Fetches the util_est from a sched_entity. */
>     unsigned int (*se_util_est)(struct sched_entity *se);
> 
>     /* Fetches the current CPU capacity from an rq. */
>     unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
> };
> 
> The idea is simple: given a base type (e.g. rq, cfs_rq, sched_entity), the
> module calls a getter function that returns the data it needs. These getters
> encapsulate internal kernel logic and remove the need for the module to
> replicate or guess how to access scheduler internals.
> 
> Since these additions would be part of tracepoints used for
> testing/debugging, they are not considered stable ABI and can evolve as the
> kernel changes. It would be up to the module to adapt to changes in available
> hooks, types, or fields — something we already do today using BTF for
> disappearing types (e.g. struct util_est becoming a raw integer).
> 
> While this approach would require some extra code in the kernel to define the
> callback struct and register the functions, we believe it would significantly
> improve testability and maintainability of tooling like LISA. It could even
> be extended to support non-energy-aware scheduler debugging scenarios as
> well.
> 
> Our current testing pipeline already makes heavy use of LISA [1], which
> automates test execution and data analysis. It also integrates with rt-app
> [2] to generate configurable workloads.
> 
> The attached proof-of-concept patch adds three such callback functions as a
> demonstration. We’ve tested this against a modified version of our module
> that uses the callbacks to fetch scheduler internals.
> 
> Thoughts?
> 
> [1] https://tooling.sites.arm.com/lisa/latest/
> [2] https://github.com/scheduler-tools/rt-app
> 
> Signed-off-by: Luis Machado <luis.machado@....com>
> ---
>  include/trace/events/sched.h | 27 +++++++++++++++++----------
>  kernel/sched/core.c          | 27 +++++++++++++++++++++++++++
>  kernel/sched/fair.c          | 16 ++++++++--------
>  kernel/sched/pelt.c          |  6 +++---
>  kernel/sched/sched.h         | 11 +++++++++++
>  5 files changed, 66 insertions(+), 21 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c..0687f4f62d9 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -767,6 +767,13 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>  	TP_printk("cpu=%d", __entry->cpu)
>  );
>  
> +/* This struct is declared here so the tracepoints below can pass
> + * these types as parameter.
> + * This is only used for testing and debugging, so tracepoint probes can
> + * use the callbacks to fetch the data they need.
> + */
> +struct sched_tp_callbacks;
> +
>  /*
>   * Following tracepoints are not exported in tracefs and provide hooking
>   * mechanisms only for testing and debugging purposes.
> @@ -774,8 +781,8 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>   * Postfixed with _tp to make them easily identifiable in the code.
>   */
>  DECLARE_TRACE(pelt_cfs_tp,
> -	TP_PROTO(struct cfs_rq *cfs_rq),
> -	TP_ARGS(cfs_rq));
> +	TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
> +	TP_ARGS(cfs_rq, sched_tp_callbacks));
>  
>  DECLARE_TRACE(pelt_rt_tp,
>  	TP_PROTO(struct rq *rq),
> @@ -794,24 +801,24 @@ DECLARE_TRACE(pelt_irq_tp,
>  	TP_ARGS(rq));
>  
>  DECLARE_TRACE(pelt_se_tp,
> -	TP_PROTO(struct sched_entity *se),
> -	TP_ARGS(se));
> +	TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
> +	TP_ARGS(se, sched_tp_callbacks));
>  
>  DECLARE_TRACE(sched_cpu_capacity_tp,
> -	TP_PROTO(struct rq *rq),
> -	TP_ARGS(rq));
> +	TP_PROTO(struct rq *rq, struct sched_tp_callbacks *sched_tp_callbacks),
> +	TP_ARGS(rq, sched_tp_callbacks));
>  
>  DECLARE_TRACE(sched_overutilized_tp,
>  	TP_PROTO(struct root_domain *rd, bool overutilized),
>  	TP_ARGS(rd, overutilized));
>  
>  DECLARE_TRACE(sched_util_est_cfs_tp,
> -	TP_PROTO(struct cfs_rq *cfs_rq),
> -	TP_ARGS(cfs_rq));
> +	TP_PROTO(struct cfs_rq *cfs_rq, struct sched_tp_callbacks *sched_tp_callbacks),
> +	TP_ARGS(cfs_rq, sched_tp_callbacks));
>  
>  DECLARE_TRACE(sched_util_est_se_tp,
> -	TP_PROTO(struct sched_entity *se),
> -	TP_ARGS(se));
> +	TP_PROTO(struct sched_entity *se, struct sched_tp_callbacks *sched_tp_callbacks),
> +	TP_ARGS(se, sched_tp_callbacks));
>  
>  DECLARE_TRACE(sched_update_nr_running_tp,
>  	TP_PROTO(struct rq *rq, int change),
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba..a0ee7534aaa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -99,6 +99,27 @@
>  EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
>  
> +static unsigned int cfs_rq_util_est(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_rq ? cfs_rq->avg.util_est: 0;
> +}
> +
> +static unsigned int se_util_est(struct sched_entity *se)
> +{
> +
> +	return se ? se->avg.util_est & ~UTIL_AVG_UNCHANGED : 0;
> +}
> +
> +static unsigned long rq_cpu_current_capacity(struct rq *rq)
> +{
> +	if (rq == NULL)
> +		return 0;
> +
> +	unsigned long capacity_orig = per_cpu(cpu_scale, rq->cpu);
> +	unsigned long scale_freq = per_cpu(arch_freq_scale, rq->cpu);
> +	return cap_scale(capacity_orig, scale_freq);
> +}
> +
>  /*
>   * Export tracepoints that act as a bare tracehook (ie: have no trace event
>   * associated with them) to allow external modules to probe them.
> @@ -8524,11 +8545,17 @@ LIST_HEAD(task_groups);
>  static struct kmem_cache *task_group_cache __ro_after_init;
>  #endif
>  
> +struct sched_tp_callbacks sched_tp_callbacks;
> +
>  void __init sched_init(void)
>  {
>  	unsigned long ptr = 0;
>  	int i;
>  
> +	sched_tp_callbacks.cfs_rq_util_est = cfs_rq_util_est;
> +	sched_tp_callbacks.se_util_est = se_util_est;
> +	sched_tp_callbacks.rq_cpu_current_capacity = rq_cpu_current_capacity;
> +
>  	/* Make sure the linker didn't screw up */
>  #ifdef CONFIG_SMP
>  	BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e58..f115842669b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4433,8 +4433,8 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>  	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
>  	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
>  
> -	trace_pelt_cfs_tp(cfs_rq);
> -	trace_pelt_se_tp(se);
> +	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
> +	trace_pelt_se_tp(se, &sched_tp_callbacks);
>  
>  	return 1;
>  }
> @@ -4698,7 +4698,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  
>  	cfs_rq_util_change(cfs_rq, 0);
>  
> -	trace_pelt_cfs_tp(cfs_rq);
> +	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
>  }
>  
>  /**
> @@ -4728,7 +4728,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  
>  	cfs_rq_util_change(cfs_rq, 0);
>  
> -	trace_pelt_cfs_tp(cfs_rq);
> +	trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
>  }
>  
>  /*
> @@ -4865,7 +4865,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
>  	enqueued += _task_util_est(p);
>  	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>  
> -	trace_sched_util_est_cfs_tp(cfs_rq);
> +	trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
>  }
>  
>  static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> @@ -4881,7 +4881,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
>  	enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
>  	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>  
> -	trace_sched_util_est_cfs_tp(cfs_rq);
> +	trace_sched_util_est_cfs_tp(cfs_rq, &sched_tp_callbacks);
>  }
>  
>  #define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> @@ -4970,7 +4970,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>  	ewma |= UTIL_AVG_UNCHANGED;
>  	WRITE_ONCE(p->se.avg.util_est, ewma);
>  
> -	trace_sched_util_est_se_tp(&p->se);
> +	trace_sched_util_est_se_tp(&p->se, &sched_tp_callbacks);
>  }
>  
>  static inline unsigned long get_actual_cpu_capacity(int cpu)
> @@ -10009,7 +10009,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  		capacity = 1;
>  
>  	cpu_rq(cpu)->cpu_capacity = capacity;
> -	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> +	trace_sched_cpu_capacity_tp(cpu_rq(cpu), &sched_tp_callbacks);
>  
>  	sdg->sgc->capacity = capacity;
>  	sdg->sgc->min_capacity = capacity;
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 7a8534a2def..7ca37abf46b 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -296,7 +296,7 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
>  {
>  	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
>  		___update_load_avg(&se->avg, se_weight(se));
> -		trace_pelt_se_tp(se);
> +		trace_pelt_se_tp(se, &sched_tp_callbacks);
>  		return 1;
>  	}
>  
> @@ -310,7 +310,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  
>  		___update_load_avg(&se->avg, se_weight(se));
>  		cfs_se_util_change(&se->avg);
> -		trace_pelt_se_tp(se);
> +		trace_pelt_se_tp(se, &sched_tp_callbacks);
>  		return 1;
>  	}
>  
> @@ -325,7 +325,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>  				cfs_rq->curr != NULL)) {
>  
>  		___update_load_avg(&cfs_rq->avg, 1);
> -		trace_pelt_cfs_tp(cfs_rq);
> +		trace_pelt_cfs_tp(cfs_rq, &sched_tp_callbacks);
>  		return 1;
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34ea7..5b0b8bb460c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -182,6 +182,17 @@ extern struct list_head asym_cap_list;
>   */
>  #define RUNTIME_INF		((u64)~0ULL)
>  
> +struct sched_tp_callbacks {
> +	/* Fetches the util_est from a cfs_rq. */
> +	unsigned int (*cfs_rq_util_est)(struct cfs_rq *cfs_rq);
> +	/* Fetches the util_est from a sched_entity. */
> +	unsigned int (*se_util_est)(struct sched_entity *se);
> +	/* Fetches the current cpu capacity out of a rq. */
> +	unsigned long (*rq_cpu_current_capacity)(struct rq *rq);
> +};
> +
> +extern struct sched_tp_callbacks sched_tp_callbacks;
> +
>  static inline int idle_policy(int policy)
>  {
>  	return policy == SCHED_IDLE;
> 
> base-commit: 328802738e1cd091d04076317f3c2174125c5916

And I see the kernel test robot has an issue building on i386 that I need to look at first.

cpu_scale and arch_freq_scale are not defined there.