linux-kernel - Re: [PATCH RESEND] sched: Fix incorrect runnable time calculation in sched_stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <6a366f508a004b26a48d0328341adfa4@transsion.com>
Date: Mon, 17 Mar 2025 07:03:15 +0000
From: xianhe.zhou(周显赫) <xianhe.zhou@...nssion.com>
To: "vineethr@...ux.ibm.com" <vineethr@...ux.ibm.com>
CC: "mingo@...hat.com" <mingo@...hat.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
	"incent.guittot@...aro.org" <incent.guittot@...aro.org>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
	"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "rostedt@...dmis.org"
	<rostedt@...dmis.org>, "bsegall@...gle.com" <bsegall@...gle.com>,
	"mgorman@...e.de" <mgorman@...e.de>, "vschneid@...hat.com"
	<vschneid@...hat.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RESEND] sched: Fix incorrect runnable time calculation in
 sched_stat_wait tracepoint

On 12 Mar 2025 23:37:03 +0530, Madadi Vineeth Reddy wrote:
> As per my understanding, when a temporary dequeue occurs, the delay value
> printed should be smaller compared to when these temporary events are
> skipped, as the delay is then computed between the final dequeue and the
> original start time.

> I think I am missing something. Can you take an example and explain?

Thank you for your feedback. Let me clarify the root cause and the patch's
rationale:
1. Root Cause of Incorrect Runnable Time Statistics
The erroneous runnable time values (e.g., 352409961840 ns) stem from
improper timestamp pairing during runqueue (rq) migrations.
For example:
When an RT task migrates from CPU0 to CPU6:
A wait_start timestamp is recorded upon enqueueing to CPU0.
No wait_end is logged when dequeuing from CPU0 (the runnable time 
accumulated on CPU0 should contribute to its total runnable time).
A new wait_start is incorrectly recorded when enqueueing to CPU6.
This mismatch creates a "dangling" start timestamp, causing the final runnable
time calculation to use an invalid start point, thereby producing wildly inflated
values.

2. Temporary Enqueue/Dequeue Issues
The original logic also erroneously accounted for temporary enqueue/dequeue
operations as valid runnable time. For example:

When modifying the nice value of a CFS task while it is actively running:
The task is temporarily dequeued (DEQUEUE_SAVE) and re-enqueued
(ENQUEUE_RESTORE).
The brief window between these operations was incorrectly treated as
runnable time.
The patch explicitly skips timestamp updates for such transient operations
by checking the DEQUEUE_SAVE/ENQUEUE_RESTORE flags, ensuring only
meaningful runnable periods are measured.

Summary of Fixes
RT task migration: Ensures proper timestamp pairing across rq migrations.
Transient operations: Excludes temporary enqueue/dequeue events
(e.g., priority adjustments) from runnable time calculations.
The test results demonstrate that these changes restore valid runnable
time measurements (e.g., ~50 ms vs. the original 352 sec outliers).

Thanks,
Xianhe Zhou

On 06/03/25 08:09, xianhe.zhou(周显赫) wrote:
> From b07df3ed3fd2186a698ef87ad366523ec7c7fccb Mon Sep 17 00:00:00 2001
> From: "xianhe.zhou" <xianhe.zhou@...nssion.com>
> Date: Thu, 20 Feb 2025 22:01:26 +0800
> Subject: [PATCH] sched: Fix incorrect runnable time calculation in
>  sched_stat_wait tracepoint
>
> The sched_stat_wait tracepoint is intended to record the runnable
> time of a task. However, due to incorrect timestamp handling, two major
> issues arise:
>
> 1. The tracepoint may incorrectly record consecutive start timestamps
>    or consecutive end timestamps, rather than properly pairing them.
>    This leads to incorrect runnable time calculations.
> 2. The recorded time may include temporary dequeuing and enqueuing
>    events, such as when modifying a task's nice value while it is running.
>
> This patch ensures that timestamps are properly paired and excludes
> temporary dequeuing events to prevent incorrect measurements.
>
> Testing:
> This issue was tested on an Android 14 device. The following are the
> trace results before and after applying the patch:
>
> Before the fix (incorrect values):
>
> sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352418315839 [ns]
> sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352409961840 [ns]
> sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352393317455 [ns]
> sched_stat_wait: comm=binder:1148_3 pid=2102 delay=352380493377 [ns]
> sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352360033377 [ns]
>
> Clearly, these runnable time values, in the hundreds of seconds,
> are incorrect.
>
> After the fix (corrected values):
>
> sched_stat_wait: comm=kworker/7:1 pid=13682 delay=63516923 [ns]
> sched_stat_wait: comm=SuperA_R[136] pid=15074 delay=51324385 [ns]
> sched_stat_wait: comm=SuperA_R[136] pid=15080 delay=50280539 [ns]
> sched_stat_wait: comm=SuperA_R[136] pid=15071 delay=48995692 [ns]
> sched_stat_wait: comm=SuperA_R[34] pid=14951 delay=41876308 [ns]
>
> After the fix, the recorded runnable times are within expected ranges.

As per my understanding, when a temporary dequeue occurs, the delay value
printed should be smaller compared to when these temporary events are
skipped, as the delay is then computed between the final dequeue and the
original start time.

I think I am missing something. Can you take an example and explain?

Thanks,
Madadi Vineeth Reddy

>
> Fixes: 768d0c27226e ("sched: Add wait, sleep and iowait accounting tracepoints")
> Signed-off-by: xianhe.zhou <xianhe.zhou@...nssion.com>
> ---
>  kernel/sched/fair.c  | 13 +++++++------
>  kernel/sched/rt.c    | 15 ++++++++++++---
>  kernel/sched/stats.c |  7 ++++++-
>  3 files changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c0ef435a7aa..37a315b3906b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1331,11 +1331,12 @@ update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
>       if (!schedstat_enabled())
>               return;
>
> -     /*
> -      * Are we enqueueing a waiting task? (for current tasks
> -      * a dequeue/enqueue event is a NOP)
> +    /*
> +     * Are we enqueueing a waiting task? And it is not
> +     * a temporary enqueue?(for current tasks a
> +     * dequeue/enqueue event is a NOP)
>        */
> -     if (se != cfs_rq->curr)
> +     if (se != cfs_rq->curr && !(flags & ENQUEUE_RESTORE))
>               update_stats_wait_start_fair(cfs_rq, se);
>
>       if (flags & ENQUEUE_WAKEUP)
> @@ -1351,9 +1352,9 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
>
>       /*
>        * Mark the end of the wait period if dequeueing a
> -      * waiting task:
> +      * waiting task and it is not a temporary dequeue:
>        */
> -     if (se != cfs_rq->curr)
> +     if (se != cfs_rq->curr && !(flags & DEQUEUE_SAVE))
>               update_stats_wait_end_fair(cfs_rq, se);
>
>       if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4b8e33c615b1..991251feeaf6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1310,9 +1310,17 @@ static inline void
>  update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
>                       int flags)
>  {
> +     struct task_struct *p = NULL;
> +
>       if (!schedstat_enabled())
>               return;
>
> +     if (rt_entity_is_task(rt_se))
> +             p = rt_task_of(rt_se);
> +
> +     if (p != rq_of_rt_rq(rt_rq)->curr && !(flags & ENQUEUE_RESTORE))
> +             update_stats_wait_start_rt(rt_rq, rt_se);
> +
>       if (flags & ENQUEUE_WAKEUP)
>               update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
>  }
> @@ -1348,6 +1356,10 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
>       if (rt_entity_is_task(rt_se))
>               p = rt_task_of(rt_se);
>
> +     if (p != rq_of_rt_rq(rt_rq)->curr && !(flags & DEQUEUE_SAVE)) {
> +             update_stats_wait_end_rt(rt_rq, rt_se);
> +     }
> +
>       if ((flags & DEQUEUE_SLEEP) && p) {
>               unsigned int state;
>
> @@ -1474,9 +1486,6 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>       if (flags & ENQUEUE_WAKEUP)
>               rt_se->timeout = 0;
>
> -     check_schedstat_required();
> -     update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
> -
>       enqueue_rt_entity(rt_se, flags);
>
>       if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 4346fd81c31f..367f4bb1c759 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -20,8 +20,13 @@ void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
>  void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
>                            struct sched_statistics *stats)
>  {
> -     u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);
> +     u64 delta;
> +     u64 wait_start = schedstat_val(stats->wait_start);
>
> +     if (!wait_start)
> +             return;
> +
> +     delta = rq_clock(rq) - wait_start;
>       if (p) {
>               if (task_on_rq_migrating(p)) {
>                       /*