linux-kernel - Re: [PATCH] perf/x86: fix event counter update issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1702220941570.24020@macbook-air>
Date:   Wed, 22 Feb 2017 09:49:51 -0500 (EST)
From:   Vince Weaver <vincent.weaver@...ne.edu>
To:     Peter Zijlstra <peterz@...radead.org>
cc:     "Odzioba, Lukasz" <lukasz.odzioba@...el.com>,
        "Liang, Kan" <kan.liang@...el.com>,
        Stephane Eranian <eranian@...gle.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>
Subject: Re: [PATCH] perf/x86: fix event counter update issue

On Mon, 5 Dec 2016, Peter Zijlstra wrote:

> ---
> Subject: perf,x86: Fix full width counter, counter overflow
> Date: Tue, 29 Nov 2016 20:33:28 +0000
> 
> Lukasz reported that perf stat counters overflow is broken on KNL/SLM.
> 
> Both these parts have full_width_write set, and that does indeed have
> a problem. In order to deal with counter wrap, we must sample the
> counter at at least half the counter period (see also the sampling
> theorem) such that we can unambiguously reconstruct the count.

I know I'm a bit late to this issue, but I suddenly have PAPI users being 
very worried that their results are going to be wrong ad I hadn't heard 
about this until recently.

I'm trying to make a reproducer test and want to make sure I understand 
the issue.  (And I don't have any of the easy trigger hardware either. 
And what is SLM?  Silvermont?  Are we really that short on Changelog space
that we can't spell out the abbreviations to make things clear for 
non-Intel employees?)

So from what I understand, the issue is if we have an architecture with 
full-width counters and we trigger a x86_perf_event_update() when bit
47 is set?

So if I have a test that runs in a loop for 2^48 retired instructions
(which takes ~12 hours on a recent machine) and then reads the results,
they might be wrong?

It sounds like this can also be triggered by a sampling event with a 
really long period, but I couldn't puzzle out from the Changelog exactly 
how to reproduce this (or even how serious the issue is).

Vince