[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTimyk1PvCBQzqcmcqL7HetXso=_akKvrUM0+hOoD@mail.gmail.com>
Date: Fri, 3 Sep 2010 13:02:49 +0200
From: Stephane Eranian <eranian@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Don Zickus <dzickus@...hat.com>,
Robert Richter <robert.richter@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mingo@...e.hu" <mingo@...e.hu>
Subject: Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event
on intel perf counter
On Fri, Sep 3, 2010 at 10:33 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, 2010-09-02 at 16:39 +0200, Stephane Eranian wrote:
>> I managed to reproduce on core i7 860 (without patch4).
>> Looking at the code again, I am dubious you ever execute
>> the retry goto. If the PMU is disabled and you've just
>> cleared the OVF_STAT, then I don't see where the new
>> overflows would come from. But that's a separate problem.
>>
>> One thing I did is to compare status obtained via OVFL_STATUS
>> with one that I build manually by inspecting each individual
>> counter. The two returned bitmasks should always be identical
>> (with PEBS disabled). When I got the spurious NMI, it did not
>> trip my status validation. So the OVFL_STATUS is valid.
>>
>> I found something else that looked fishy. I am experimenting
>> with it. I will report back.
>
> One thing we still need to do is on init detect if the BIOS is using one
> of the PMCs and simply disable all of perf and print a nice big message
> to the user to request a new BIOS from their vendor.
>
Given then way perf_events operate, that is your only choice at this point.
But I am sure neither my system nor yours is subject to this particular issue
yet there is some unexplained errors with OVF_STATUS.
Here is an example of what I gathered on a Westmere:
This is coming into the interrupt handler:
- status = overflow status coming from GLOBAL_OVF_STATUS
- status2 = inspection of the counters
- act = cpuc->active_mask[0]
In case both status don't match, I dump the state of the active events
incl. the counter values(val).
[ 822.813808] CPU2 irqin status=0x6 status2=0x4 act=0x7
[ 822.813818] CPU2 cfg=0x13003c idx=0 sel=53003c val=ffffa833f298
[ 822.813821] CPU2 cfg=0x12003c idx=1 sel=52003c val=fffffe130229
[ 822.813823] CPU2 cfg=0x11003c idx=2 sel=51003c val=5e9
Here only counter2 has overflowed, yet the handler will also process counter1
which is wrong.
The other thing I noticed is that in intel_pmu_disable_event(), the event
stopped sometimes has overflowed. Looks like OVF_STAUS is stale.
Maybe OVF_STATUS is not cleared properly somewhere, possibly when
an event gets disabled.
I have a busy system, with the NMI watchdog running (0x13003c) where I do:
perf record -a -C 1 -e cycles:k -ecycles:u -F 10 -- sleep 10
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists