linux-kernel - Potential Issue in Tracing Ring Buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20201124223917.795844-1-elavila@google.com>
Date:   Tue, 24 Nov 2020 22:39:17 +0000
From:   "J. Avila" <elavila@...gle.com>
To:     rostedt@...dmis.org, mingo@...hat.com
Cc:     gregkh@...uxfoundation.org, john.stultz@...aro.org,
        linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Potential Issue in Tracing Ring Buffer

Hello,

In the ftrace logs we've collected internally, we have found that there are
situations where time seems to go backwards; this breaks userspace tools which
expect time to always go forward in these logs. For example, in this snippet
from a db845c running a 5.10-rc4 kernel[1] (thanks for getting us the trace,
John!), we see:

      android.bg-568     [002] dN.1        26247: rcu_utilization: Start context switch
           <...>-589     [003] ....        26247: sys_enter: NR 73 (b400006f4898c520, 2, 0, 0, 0, 11)
 PowerManagerSer-578     [001] d..2        26248: sched_switch: prev_comm=PowerManagerSer prev_pid=578 prev_prio=116 prev_state=S ==> next_comm=Binder:381_2 next_pid=395 next_prio=120
      android.bg-568     [002] dN.1        26248: rcu_utilization: End context switch
           <...>-589     [003] d..1        26249: rcu_utilization: Start context switch
           <...>-589     [003] d..1        26250: rcu_utilization: End context switch
    Binder:381_2-395     [001] ....        26251: sys_exit: NR 98 = 0
      android.bg-568     [002] dN.2        26251: sched_stat_wait: comm=kworker/u16:3 pid=164 delay=0 [ns]
           <...>-589     [003] d..2        26252: sched_stat_runtime: comm=statsd.writer pid=589 runtime=611094 [ns] vruntime=24367153868 [ns]
      android.bg-568     [002] d..2        26254: sched_switch: prev_comm=android.bg prev_pid=568 prev_prio=130 prev_state=R+ ==> next_comm=kworker/u16:3 next_pid=164 next_prio=120
   kworker/u16:3-164     [002] ....        26256: workqueue_execute_start: work struct 00000000ab3fe95e: function ufshcd_clk_scaling_resume_work
    Binder:381_2-395     [001] ....        26257: sys_enter: NR 98 (b400006fb8984bf0, 89, 185c, 0, 0, ffffffff)
    Binder:381_2-395     [001] d..1        26258: rcu_utilization: Start context switch
   kworker/u16:3-164     [002] ....        26258: workqueue_execute_end: work struct 00000000ab3fe95e: function ufshcd_clk_scaling_resume_work
    Binder:381_2-395     [001] d..1        26260: rcu_utilization: End context switch
   kworker/u16:3-164     [002] d..1        26261: rcu_utilization: Start context switch
    Binder:381_2-395     [001] d..2        26262: sched_stat_runtime: comm=Binder:381_2 pid=395 runtime=100989 [ns] vruntime=40144094287 [ns]
   kworker/u16:3-164     [002] d..1        26263: rcu_utilization: End context switch
   kworker/u16:3-164     [002] d..2        26237: sched_stat_runtime: comm=kworker/u16:3 pid=164 runtime=71614 [ns] vruntime=32851170186 [ns]
   kworker/u16:3-164     [002] d..2        26240: sched_stat_wait: comm=android.bg pid=568 delay=99635 [ns]
   kworker/u16:3-164     [002] d..2        26241: sched_switch: prev_comm=kworker/u16:3 prev_pid=164 prev_prio=120 prev_state=I ==> next_comm=android.bg next_pid=568 next_prio=130

In this trace, we switched the trace clock to counter to rule out hardware
issues, but this has been seen with default settings as well. This is
consistently reproducible - we see it regularly when collecting any trace for
~60 seconds. This seems like it could be a problem in the upstream code, as this
kernel had no modifications made to the ftrace code. Is this a known problem in
upstream? Is there any additional information I could provide to verify this is
an upstream issue? Is anyone else observing this behavior, and if so, do they
happen to be working on a fix? If not, do you have any pointers on how we can
address this?

Thanks,

Avila

[1] Taken from https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline