linux-kernel - [PATCH] sched: Fix incorrect runnable time calculation in sched_stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <40fe38e2c9594608849785c4d4d2a6d7@transsion.com>
Date: Thu, 20 Feb 2025 14:32:46 +0000
From: xianhe.zhou(周显赫) <xianhe.zhou@...nssion.com>
To: "mingo@...hat.com" <mingo@...hat.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
	"incent.guittot@...aro.org" <incent.guittot@...aro.org>,
	"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>
CC: "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>, "bsegall@...gle.com"
	<bsegall@...gle.com>, "mgorman@...e.de" <mgorman@...e.de>,
	"vschneid@...hat.com" <vschneid@...hat.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: [PATCH] sched: Fix incorrect runnable time calculation in
 sched_stat_wait tracepoint

From b07df3ed3fd2186a698ef87ad366523ec7c7fccb Mon Sep 17 00:00:00 2001
From: "xianhe.zhou" <xianhe.zhou@...nssion.com>
Date: Thu, 20 Feb 2025 22:01:26 +0800
Subject: [PATCH] sched: Fix incorrect runnable time calculation in
sched_stat_wait tracepoint

The sched_stat_wait tracepoint is intended to record the runnable
time of a task. However, due to incorrect timestamp handling, two major
issues arise:

1. The tracepoint may incorrectly record consecutive start timestamps
   or consecutive end timestamps, rather than properly pairing them.
   This leads to incorrect runnable time calculations.
2. The recorded time may include temporary dequeuing and enqueuing
   events, such as when modifying a task's nice value while it is running.

This patch ensures that timestamps are properly paired and excludes
temporary dequeuing events to prevent incorrect measurements.

Testing:
This issue was tested on an Android 14 device. The following are the
trace results before and after applying the patch:

Before the fix (incorrect values):

sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352418315839 [ns]
sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352409961840 [ns]
sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352393317455 [ns]
sched_stat_wait: comm=binder:1148_3 pid=2102 delay=352380493377 [ns]
sched_stat_wait: comm=OverlayEngine_0 pid=1533 delay=352360033377 [ns]

Clearly, these runnable time values, in the hundreds of seconds,
are incorrect.

After the fix (corrected values):

sched_stat_wait: comm=kworker/7:1 pid=13682 delay=63516923 [ns]
sched_stat_wait: comm=SuperA_R[136] pid=15074 delay=51324385 [ns]
sched_stat_wait: comm=SuperA_R[136] pid=15080 delay=50280539 [ns]
sched_stat_wait: comm=SuperA_R[136] pid=15071 delay=48995692 [ns]
sched_stat_wait: comm=SuperA_R[34] pid=14951 delay=41876308 [ns]

After the fix, the recorded runnable times are within expected ranges.

Fixes: 768d0c27226e ("sched: Add wait, sleep and iowait accounting tracepoints")
Signed-off-by: xianhe.zhou <xianhe.zhou@...nssion.com>
---
kernel/sched/fair.c  | 13 +++++++------
kernel/sched/rt.c    | 15 ++++++++++++---
kernel/sched/stats.c |  7 ++++++-
3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..37a315b3906b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1331,11 +1331,12 @@ update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
     if (!schedstat_enabled())
            return;

-     /*
-     * Are we enqueueing a waiting task? (for current tasks
-     * a dequeue/enqueue event is a NOP)
+    /*
+     * Are we enqueueing a waiting task? And it is not
+     * a temporary enqueue?(for current tasks a
+     * dequeue/enqueue event is a NOP)
      */
-     if (se != cfs_rq->curr)
+     if (se != cfs_rq->curr && !(flags & ENQUEUE_RESTORE))
            update_stats_wait_start_fair(cfs_rq, se);

      if (flags & ENQUEUE_WAKEUP)
@@ -1351,9 +1352,9 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl

      /*
      * Mark the end of the wait period if dequeueing a
-     * waiting task:
+     * waiting task and it is not a temporary dequeue:
      */
-     if (se != cfs_rq->curr)
+     if (se != cfs_rq->curr && !(flags & DEQUEUE_SAVE))
            update_stats_wait_end_fair(cfs_rq, se);

      if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..991251feeaf6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1310,9 +1310,17 @@ static inline void
update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
                   int flags)
{
+     struct task_struct *p = NULL;
+
     if (!schedstat_enabled())
            return;

+     if (rt_entity_is_task(rt_se))
+            p = rt_task_of(rt_se);
+
+     if (p != rq_of_rt_rq(rt_rq)->curr && !(flags & ENQUEUE_RESTORE))
+            update_stats_wait_start_rt(rt_rq, rt_se);
+
     if (flags & ENQUEUE_WAKEUP)
            update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
}
@@ -1348,6 +1356,10 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
     if (rt_entity_is_task(rt_se))
            p = rt_task_of(rt_se);

+     if (p != rq_of_rt_rq(rt_rq)->curr && !(flags & DEQUEUE_SAVE)) {
+            update_stats_wait_end_rt(rt_rq, rt_se);
+     }
+
     if ((flags & DEQUEUE_SLEEP) && p) {
            unsigned int state;

@@ -1474,9 +1486,6 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
     if (flags & ENQUEUE_WAKEUP)
            rt_se->timeout = 0;

-     check_schedstat_required();
-     update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
-
     enqueue_rt_entity(rt_se, flags);

      if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 4346fd81c31f..367f4bb1c759 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -20,8 +20,13 @@ void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
                        struct sched_statistics *stats)
{
-     u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);
+     u64 delta;
+     u64 wait_start = schedstat_val(stats->wait_start);

+     if (!wait_start)
+            return;
+
+     delta = rq_clock(rq) - wait_start;
     if (p) {
            if (task_on_rq_migrating(p)) {
                   /*
-- 
2.34.1