linux-kernel - Re: Scheduling latency outliers observed with perf sched record

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z9tHjRYnVfbQCPui@google.com>
Date: Wed, 19 Mar 2025 15:39:09 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Daniel Dao <dqminh@...udflare.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Jiri Olsa <jolsa@...nel.org>, Alexei Starovoitov <ast@...nel.org>,
	Andrii Nakryiko <andrii@...nel.org>, Song Liu <song@...nel.org>,
	kernel-team <kernel-team@...udflare.com>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	linux-perf-users@...r.kernel.org,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Scheduling latency outliers observed with perf sched record

Hello,

On Fri, Mar 14, 2025 at 11:59:23AM +0000, Daniel Dao wrote:
> Hi,
> 
>  At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
> observe scheduling
>  latency outliers with `perf sched record`. For example:
> 
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task                  |   Runtime ms  |  Count   | Avg delay ms    |
> Max delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
> Pingora HTTP Pr:3133982 |      0.080 ms |        1 | avg: 881.614 ms |
> max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
> s
> 
> The scheduling timeline of the outlier is as follow
> 
>           time    cpu  task name                       wait time  sch
> delay   run time  state
>                         [tid/pid]                          (msec)
> (msec)     (msec)
> --------------- ------  ------------------------------  ---------
> ---------  ---------  -----
>   312881.567210 [0058]  Pingora HTTP Pr[3045283/3045241
>                    awakened: Pingora HTTP Pr[3133982/3045241
>   312882.448807 [0015]  Pingora HTTP Pr[3045278/3045241
>                    awakened: Pingora HTTP Pr[3133982/3045241
>   312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
>                    awakened: Pingora HTTP Pr[3045278/3045241
>   312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
> 881.614      0.087      S
> 
> According to `perf sched record`, the task received several wakeups
> before it finally executed on CPU 118. The timeline
> on CPU 118 is as follow:
> 
>   312882.448824 [0118]  <idle>                               0.024
>  0.024      0.416      I
>   312882.448861 [0118]  Pingora HTTP Pr[3133982/3045241
>                    awakened: Pingora HTTP Pr[3045278/3045241
>   312882.448911 [0118]  Pingora HTTP Pr[3133982/3045241      0.000
> 881.614      0.087      S
> 
> This is very strange and seems impossible because the task was woken
> up before but not able to run yet while the CPU has idle time.
> 
> In production, we usually do not use `perf sched record`, but using a
> ebpf program that is similar to
> https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but
> 
> tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
>   @qtime[args.pid] = nsecs;
> }
> 
> tracepoint:sched:sched_switch {
>   if (args.prev_state == TASK_RUNNING) {
>     @qtime[args.prev_pid] = nsecs;
>   }
> 
>   $ns = @qtime[args.next_pid];
>   if ($ns) {
>     @usecs = hist((nsecs - $ns) / 1000);
>   }
>   delete(@qtime, args.next_pid);
> }
> 
> We do not observe any outliers with this program. We figured that the
> difference with perf record is in the case of
> supposedly repeated wakeups. The ebpf script took the last wakeup,
> while `perf sched latency` took the first wakeup. Once we
> adjusted our bpftrace script to take the first wakeup, we observed the
> same outliers.
> 
> Looking into this further, we adjusted the same bpftrace script to
> also print out `sum_exec_runtime` to make sure that
> the process didn't actually run i.e. we do not miss any sched_switch
> events. The adjusted script is as follow:
> 
> #define TASK_RUNNING 0
> #define THRESHOLD 100
> 
> rt:sched_wakeup, rt:sched_wakeup_new
> {
>   $task = (struct task_struct *)arg0;
>   $ns = @qtime[$task->pid];
>   $sum_exec_runtime = @sum_exec_runtime[$task->pid];
>   if ($ns > 0 ) {
>     $ms = ((nsecs - $ns) / 1000000);
>     if ($ms > 100) {
>       printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
> prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
> nsecs, $ms,
>           $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
> $sum_exec_runtime);
>     }
>     if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
>       // repeated wakeup, and task hasn't been run before, and we have
> not missed any sched_switch events, do not count
>       return;
>     }
>   }
>   @qtime[$task->pid] = nsecs;
>   @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
> }
> 
> rt:sched_switch
> {
>   $prev = (struct task_struct *)arg1;
>   $next = (struct task_struct *)arg2;
>   $prev_state = arg3;
>   if ($prev_state == TASK_RUNNING) {
>     @qtime[$prev->pid] = nsecs;
>     @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
>   }
> 
>   $ns = @qtime[$next->pid];
>   if ($ns) {
>     @usecs = hist((nsecs - $ns) / 1000);
>   }
>   delete(@qtime[$next->pid]);
>   delete(@sum_exec_runtime[$next->pid]);
> }
> 
> The script gives us the following data:
> 
>   pid=2197474 t=298353700427405 delay_ms=1157
> sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
> cached_sum_exec_runtime=1663167058
>   pid=2196692 t=298354187452899 delay_ms=1644
> sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
> cached_sum_exec_runtime=1630824755
>   pid=3290499 t=298354606215376 delay_ms=1482
> sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
> cached_sum_exec_runtime=26316654306
> 
> We can see that the script detected repeated wakeup (i.e. the program
> did not observe a sched_switch), but the task exec_runtime has
> changed.
> 
> Am I understanding correctly that this means that we must have missed
> a sched_switch tracepoint event ? We didn't see any
> miss counters for our bpf program reported by bpftool. The `perf sched
> record` command also didn't report any missing events,
> and I think it must have also missed some of these sched_switch events.
> 
> Is there anything else that we are missing here ?

perf can miss some events when the ring buffer is full.  There should be
a warning about that.  Also you can check whether the data has LOST
records by running this command.

  $ perf report --stats | grep LOST

Thanks,
Namhyung