linux-kernel - [RFC] perf/core: Set event shadow time for inactive events too

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20211011060556.1190936-1-namhyung@kernel.org>
Date:   Sun, 10 Oct 2021 23:05:56 -0700
From:   Namhyung Kim <namhyung@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Jiri Olsa <jolsa@...hat.com>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Stephane Eranian <eranian@...gle.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Ian Rogers <irogers@...gle.com>,
        Song Liu <songliubraving@...com>
Subject: [RFC] perf/core: Set event shadow time for inactive events too

Normally the shadow time is set in event_sched_in() and used to
calculate up-to-date enabled and running times without accessing
ctx->time in a NMI context.  It's fine for active events.

However, we can access inactive events as well in some cases (like
in BPF) and it caused errors due to the missing shadow time.  If
the event is not yet scheduled in for the first time, the shadow
time would be zero.  Then the time calculation in the
perf_event_read_local() can go wrong like below:

  perf_event_read_local(event, ...)
    now = event->shadow_ctx_time + perf_clock();
    __perf_update_times(event, now, ...)
      delta = now - event->tstamp;
      if (state >= INACTIVE)
        enabled += delta;
      if (state >= ACTIVE)
        running += delta;

In my understanding, the 'now' and event->tstamp are in ctx->time
which keeps actual time the given context is used.  While perf_clock()
returns a timestamp using a local clock, and event->shadow_ctx_time
should kinda convert it into ctx->time unit.

But since the shadow ctx time is not set, the delta calculation is
using two different units.  And it's added to enabled time only as
it's in an inactive state.  It'd be much bigger than ctx->time, so
it'd return much bigger enabled time.

The recent bperf infra can trigger this issue easily.  As this is
about a small window with multiplexing, we need a large number of
events and short duration like below:

  # perf stat -a -v --bpf-counters -e instructions,branches,branch-misses \
    -e cache-references,cache-misses,bus-cycles,ref-cycles,cycles sleep 0.1

  Control descriptor is not initialized
  instructions: 19616489 431324015 360374366
  branches: 3685346 417640114 344175443
  branch-misses: 75714 404089360 336145421
  cache-references: 438667 390474289 327444074
  cache-misses: 49279 349333164 272835067
  bus-cycles: 631887 283423953 165164214
  ref-cycles: 2578771111104847872 18446744069443110306 182116355
  cycles: 1785221016051271680 18446744071682768912 115821694

   Performance counter stats for 'system wide':

          19,616,489      instructions              #    0.00  insn per cycle           (83.55%)
           3,685,346      branches                                                      (82.41%)
              75,714      branch-misses             #    2.05% of all branches          (83.19%)
             438,667      cache-references                                              (83.86%)
              49,279      cache-misses              #   11.234 % of all cache refs      (78.10%)
             631,887      bus-cycles                                                    (58.27%)
  2,578,771,111,104,847,872      ref-cycles                                                    (0.00%)
  1,785,221,016,051,271,680      cycles                                                        (0.00%)

       0.010824702 seconds time elapsed

As you can see, it shows invalid values for the last two events.
The -v option shows that the enabled time is way bigger than the
running time.  So it scaled the counter values using the ratio
between the two and resulted in that.  This problem can get worse
if users want no-aggregation or cgroup aggregation with a small
interval.

Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
have a negative enabled time.  In fact, bperf keeps values returned by
bpf_perf_event_read_value() which calls perf_event_read_local(), and
accumulates delta between two calls.  When event->shadow_ctx_time is
not set, it'd return invalid enabled time which is bigger than normal.
Later, the shadow time is set and the function starts to return a
valid time.  At the moment, the recent value is smaller than before so
the delta in the bperf can be negative.

I think we need to set the shadow time even the events are inactive so
that BPF programs (or other potential users) can see valid time values
anytime.

Cc: Song Liu <songliubraving@...com>
Signed-off-by: Namhyung Kim <namhyung@...nel.org>
---
 kernel/events/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e..e28d410c67e3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3730,6 +3730,8 @@ static int merge_sched_in(struct perf_event *event, void *data)
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 		}
 
+		perf_set_shadow_time(event, ctx);
+
 		*can_add_hw = 0;
 		ctx->rotate_necessary = 1;
 		perf_mux_hrtimer_restart(cpuctx);
-- 
2.33.0.882.g93a45727a2-goog