linux-kernel - Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event on intel perf counter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 3 Sep 2010 13:02:49 +0200
From:	Stephane Eranian <eranian@...gle.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Don Zickus <dzickus@...hat.com>,
	Robert Richter <robert.richter@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...e.hu" <mingo@...e.hu>
Subject: Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event
 on intel perf counter

On Fri, Sep 3, 2010 at 10:33 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, 2010-09-02 at 16:39 +0200, Stephane Eranian wrote:
>> I managed to reproduce on core i7 860 (without patch4).
>> Looking at the code again, I am dubious you ever execute
>> the retry goto. If the PMU is disabled and you've just
>> cleared the OVF_STAT, then I don't see where the new
>> overflows would come from. But that's a separate problem.
>>
>> One thing I did is to compare status obtained via OVFL_STATUS
>> with one that I build manually by inspecting each individual
>> counter. The two returned bitmasks should always be identical
>> (with PEBS disabled).  When I got the spurious NMI, it did not
>> trip my status validation. So the OVFL_STATUS is valid.
>>
>> I found something else that looked fishy. I am experimenting
>> with it. I will report back.
>
> One thing we still need to do is on init detect if the BIOS is using one
> of the PMCs and simply disable all of perf and print a nice big message
> to the user to request a new BIOS from their vendor.
>
Given then way perf_events operate, that is your only choice at this point.

But I am sure neither my system nor yours is subject to this particular issue
yet there is some unexplained errors with OVF_STATUS.

Here is an example of what I gathered on a Westmere:

This is coming into the interrupt handler:
- status   = overflow status coming from GLOBAL_OVF_STATUS
- status2 = inspection of the counters
- act = cpuc->active_mask[0]

In case both status don't match, I dump the state of the active events
incl. the counter values(val).

[  822.813808] CPU2 irqin status=0x6 status2=0x4 act=0x7
[  822.813818] CPU2 cfg=0x13003c idx=0 sel=53003c val=ffffa833f298
[  822.813821] CPU2 cfg=0x12003c idx=1 sel=52003c val=fffffe130229
[  822.813823] CPU2 cfg=0x11003c idx=2 sel=51003c val=5e9

Here only counter2 has overflowed, yet the handler will also process counter1
which is wrong.

The other thing I noticed is that in intel_pmu_disable_event(), the event
stopped sometimes has overflowed. Looks like OVF_STAUS is stale.
Maybe OVF_STATUS is not cleared properly somewhere, possibly when
an event gets disabled.

I have a busy system, with the NMI watchdog running (0x13003c) where I do:

perf record -a -C 1 -e cycles:k -ecycles:u -F 10 -- sleep 10
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/