linux-kernel - Scheduling latency outliers observed with perf sched record

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+wXwBRKB-1pAHby6FB_jiXspLTHv+mLyVzYj7TjkuT_Y9D4yQ@mail.gmail.com>
Date: Fri, 14 Mar 2025 11:59:23 +0000
From: Daniel Dao <dqminh@...udflare.com>
To: Peter Zijlstra <peterz@...radead.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Namhyung Kim <namhyung@...nel.org>, Jiri Olsa <jolsa@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, Song Liu <song@...nel.org>, 
	kernel-team <kernel-team@...udflare.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	linux-perf-users@...r.kernel.org, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Scheduling latency outliers observed with perf sched record

Hi,

 At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
observe scheduling
 latency outliers with `perf sched record`. For example:

-------------------------------------------------------------------------------------------------------------------------------------------
Task                  |   Runtime ms  |  Count   | Avg delay ms    |
Max delay ms    | Max delay start           | Max delay end          |
-------------------------------------------------------------------------------------------------------------------------------------------
Pingora HTTP Pr:3133982 |      0.080 ms |        1 | avg: 881.614 ms |
max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
s

The scheduling timeline of the outlier is as follow

          time    cpu  task name                       wait time  sch
delay   run time  state
                        [tid/pid]                          (msec)
(msec)     (msec)
--------------- ------  ------------------------------  ---------
---------  ---------  -----
  312881.567210 [0058]  Pingora HTTP Pr[3045283/3045241
                   awakened: Pingora HTTP Pr[3133982/3045241
  312882.448807 [0015]  Pingora HTTP Pr[3045278/3045241
                   awakened: Pingora HTTP Pr[3133982/3045241
  312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
                   awakened: Pingora HTTP Pr[3045278/3045241
  312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
881.614      0.087      S

According to `perf sched record`, the task received several wakeups
before it finally executed on CPU 118. The timeline
on CPU 118 is as follow:

  312882.448824 [0118]  <idle>                               0.024
 0.024      0.416      I
  312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
                   awakened: Pingora HTTP Pr[3045278/3045241
  312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
881.614      0.087      S

This is very strange and seems impossible because the task was woken
up before but not able to run yet while the CPU has idle time.

In production, we usually do not use `perf sched record`, but using a
ebpf program that is similar to
https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but

tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
  @qtime[args.pid] = nsecs;
}

tracepoint:sched:sched_switch {
  if (args.prev_state == TASK_RUNNING) {
    @qtime[args.prev_pid] = nsecs;
  }

  $ns = @qtime[args.next_pid];
  if ($ns) {
    @usecs = hist((nsecs - $ns) / 1000);
  }
  delete(@qtime, args.next_pid);
}

We do not observe any outliers with this program. We figured that the
difference with perf record is in the case of
supposedly repeated wakeups. The ebpf script took the last wakeup,
while `perf sched latency` took the first wakeup. Once we
adjusted our bpftrace script to take the first wakeup, we observed the
same outliers.

Looking into this further, we adjusted the same bpftrace script to
also print out `sum_exec_runtime` to make sure that
the process didn't actually run i.e. we do not miss any sched_switch
events. The adjusted script is as follow:

#define TASK_RUNNING 0
#define THRESHOLD 100

rt:sched_wakeup, rt:sched_wakeup_new
{
  $task = (struct task_struct *)arg0;
  $ns = @qtime[$task->pid];
  $sum_exec_runtime = @sum_exec_runtime[$task->pid];
  if ($ns > 0 ) {
    $ms = ((nsecs - $ns) / 1000000);
    if ($ms > 100) {
      printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
nsecs, $ms,
          $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
$sum_exec_runtime);
    }
    if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
      // repeated wakeup, and task hasn't been run before, and we have
not missed any sched_switch events, do not count
      return;
    }
  }
  @qtime[$task->pid] = nsecs;
  @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
}

rt:sched_switch
{
  $prev = (struct task_struct *)arg1;
  $next = (struct task_struct *)arg2;
  $prev_state = arg3;
  if ($prev_state == TASK_RUNNING) {
    @qtime[$prev->pid] = nsecs;
    @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
  }

  $ns = @qtime[$next->pid];
  if ($ns) {
    @usecs = hist((nsecs - $ns) / 1000);
  }
  delete(@qtime[$next->pid]);
  delete(@sum_exec_runtime[$next->pid]);
}

The script gives us the following data:

  pid=2197474 t=298353700427405 delay_ms=1157
sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
cached_sum_exec_runtime=1663167058
  pid=2196692 t=298354187452899 delay_ms=1644
sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
cached_sum_exec_runtime=1630824755
  pid=3290499 t=298354606215376 delay_ms=1482
sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
cached_sum_exec_runtime=26316654306

We can see that the script detected repeated wakeup (i.e. the program
did not observe a sched_switch), but the task exec_runtime has
changed.

Am I understanding correctly that this means that we must have missed
a sched_switch tracepoint event ? We didn't see any
miss counters for our bpf program reported by bpftool. The `perf sched
record` command also didn't report any missing events,
and I think it must have also missed some of these sched_switch events.

Is there anything else that we are missing here ?

Daniel.