linux-kernel - Re: [PATCH] perf/x86: fix event counter update issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABPqkBT=64LyAo9_RgYBdFqDh0yR+aE11N_JCqX4wpi6iNLMOA@mail.gmail.com>
Date:   Mon, 28 Nov 2016 12:18:39 -0800
From:   Stephane Eranian <eranian@...gle.com>
To:     "Liang, Kan" <kan.liang@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "Odzioba, Lukasz" <lukasz.odzioba@...el.com>
Subject: Re: [PATCH] perf/x86: fix event counter update issue

On Mon, Nov 28, 2016 at 11:59 AM, Liang, Kan <kan.liang@...el.com> wrote:
>
>
>> >
>> > Here, all the possible failure cases are listed.
>> > Terms:
>> >     - new: current PMU counter value which read from rdpmcl.
>> >     - prev: previous counter value which is stored in &hwc->prev_count.
>> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
>> > Current code to calculate delta:
>> >     delta = (new << shift) - (prev << shift);
>> >     delta >>= shift;
>> >
>> > Case A: Not in PMI.  new > prev. But delta is negative.
>> >    That's the failure case of Test 2.
>> >    delta is s64 type. new and prev are u64 type. If the new is big
>> >    enough, after doing left shift and sub, the bit 64 of the result may
>> >    still be 1.
>> >    After converting to s64, the sign flag will be set. Since delta is
>> >    s64 type, arithmetic right shift is applied, which copy the sign flag
>> >    into empty bit positions on the upper end of delta.
>> >    It can be fixed by adding the max count value.
>> >
>> >    Here is the real data for test2 on KNL.
>> >    new = aea96e1927
>> >    prev = ffffff0000000001
>>
>>
>> How can new be so large here?
>> I mean between prev and new, the counter wrapped around. And it took
>> that many occurrences (of cycles) to handle the overflow?
>
> There is no overflow in this case.
> The counter is 40bit width. The highest value which can be count without
> overflow is 0xfffffffffe
>
>>
>> >
>> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
>> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
>> >
>> > Case B: In PMI. new > prev. delta is positive.
>> >    That's the failure case of Test 3.
>> >    The PMI is triggered by overflow. But there is latency between
>> >    overflow and PMI handler. So new has small amount.
>> >    Current calculation lose the max count value.
>> >
>> > Case C: In PMI. new < prev. delta is negative.
>> >    The PMU counter may be start from a big value. E.g. the fixed period
>> >    is small.
>> >    It can be fixed by adding the max count value.
>> >
>> I am not sure I understand why PMI should matter here. What matters is
>> prev vs. current and the wrap-around.
>> Please explain.
>> Thanks.
>
> Right, PMI shouldn't matter here.
> We should add max count value if delta is negative, no matter if it’s in PMI.
>
That sounds right, but then why do you have that boolean (bool pmi) in
your patch?

> To fix this issue, I once had a list which include all the possible cases.
> "non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
> So I remove it, but forget to modify the description of Case C.
>
> Thanks,
> Kan
>
>>
>> >
>> > There is still a case which delta could be wrong.
>> > The case is that event update just happens in PMI and overflow gap.
>> > It's not in PMI, new > prev, and delta is positive. Current
>> > calculation may lose the max count value.
>> > But this case rarely happens. So the rare case doesn't handle here.
>> >
>> > Reported-by: Lukasz Odzioba <lukasz.odzioba@...el.com>
>> > Signed-off-by: Kan Liang <kan.liang@...el.com>
>> > Tested-by: Lukasz Odzioba <lukasz.odzioba@...el.com>
>> > ---
>> >  arch/x86/events/core.c       | 23 +++++++++++++++++++----
>> >  arch/x86/events/intel/core.c |  4 ++--
>> >  arch/x86/events/intel/p4.c   |  2 +-
>> >  arch/x86/events/perf_event.h |  2 +-
>> >  4 files changed, 23 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
>> > 6c3b0ef..c351ac4 100644
>> > --- a/arch/x86/events/core.c
>> > +++ b/arch/x86/events/core.c
>> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>> >   * Can only be executed on the CPU where the event is active.
>> >   * Returns the delta events processed.
>> >   */
>> > -u64 x86_perf_event_update(struct perf_event *event)
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>> >  {
>> >         struct hw_perf_event *hwc = &event->hw;
>> >         int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
>> > x86_perf_event_update(struct perf_event *event)
>> >         delta = (new_raw_count << shift) - (prev_raw_count << shift);
>> >         delta >>= shift;
>> >
>> > +       /*
>> > +        * Correct delta for special cases caused by the late PMI handle.
>> > +        * Case1: PMU counter may be start from a big value.
>> > +        *        In PMI, new < prev. delta is negative.
>> > +        * Case2: new is big enough which impact the sign flag.
>> > +        *        The delta will be negative after arithmetic right shift.
>> > +        * Case3: In PMI, new > prev.
>> > +        *        The new - prev lose the max count value.
>> > +        *
>> > +        * There may be event update in PMI and overflow gap,
>> > +        * but it rarely happen. The rare case doesn't handle here.
>> > +        */
>> > +       if (((delta > 0) && pmi) || (delta < 0))
>> > +               delta += x86_pmu.cntval_mask + 1;
>> > +
>> >         local64_add(delta, &event->count);
>> >         local64_sub(delta, &hwc->period_left);
>> >
>> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
>> int flags)
>> >                  * Drain the remaining delta count out of a event
>> >                  * that we are disabling:
>> >                  */
>> > -               x86_perf_event_update(event);
>> > +               x86_perf_event_update(event, (flags == 0));
>> >                 hwc->state |= PERF_HES_UPTODATE;
>> >         }
>> >  }
>> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>> >
>> >                 event = cpuc->events[idx];
>> >
>> > -               val = x86_perf_event_update(event);
>> > +               val = x86_perf_event_update(event, true);
>> >                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>> >                         continue;
>> >
>> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>> >
>> >  static inline void x86_pmu_read(struct perf_event *event)  {
>> > -       x86_perf_event_update(event);
>> > +       x86_perf_event_update(event, false);
>> >  }
>> >
>> >  /*
>> > diff --git a/arch/x86/events/intel/core.c
>> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
>> > --- a/arch/x86/events/intel/core.c
>> > +++ b/arch/x86/events/intel/core.c
>> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>> >         for (i = 0; i < 4; i++) {
>> >                 event = cpuc->events[i];
>> >                 if (event)
>> > -                       x86_perf_event_update(event);
>> > +                       x86_perf_event_update(event, false);
>> >         }
>> >
>> >         for (i = 0; i < 4; i++) {
>> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
>> perf_event *event)
>> >   */
>> >  int intel_pmu_save_and_restart(struct perf_event *event)  {
>> > -       x86_perf_event_update(event);
>> > +       x86_perf_event_update(event, true);
>> >         /*
>> >          * For a checkpointed counter always reset back to 0.  This
>> >          * avoids a situation where the counter overflows, aborts the
>> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
>> > index eb05335..8117de8 100644
>> > --- a/arch/x86/events/intel/p4.c
>> > +++ b/arch/x86/events/intel/p4.c
>> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
>> *regs)
>> >                 /* it might be unflagged overflow */
>> >                 overflow = p4_pmu_clear_cccr_ovf(hwc);
>> >
>> > -               val = x86_perf_event_update(event);
>> > +               val = x86_perf_event_update(event, true);
>> >                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>> >                         continue;
>> >
>> > diff --git a/arch/x86/events/perf_event.h
>> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
>> > --- a/arch/x86/events/perf_event.h
>> > +++ b/arch/x86/events/perf_event.h
>> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>> >                                 [PERF_COUNT_HW_CACHE_OP_MAX]
>> >                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
>> >
>> > -u64 x86_perf_event_update(struct perf_event *event);
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>> >
>> >  static inline unsigned int x86_pmu_config_addr(int index)  {
>> > --
>> > 2.4.3
>> >