[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTikOaCL8FqQuUQsYPxm19WZOdarp8AMAugN0mnqQ@mail.gmail.com>
Date: Thu, 2 Sep 2010 10:13:19 +0200
From: Stephane Eranian <eranian@...gle.com>
To: Robert Richter <robert.richter@....com>
Cc: Don Zickus <dzickus@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mingo@...e.hu" <mingo@...e.hu>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event
on intel perf counter
Robert,
Do you have the test program you used to test this?
I believe the NHM hack does not solve the problem, it
just makes it harder to appear.
I suspect the real issue is that the GLOBAL_STATUS
bitmask cannot be trusted. I'd like to verify this.
Has the problem appear only on Nehalem or also on
Westmere?
On Wed, Sep 1, 2010 at 4:57 PM, Robert Richter <robert.richter@....com> wrote:
> On 01.09.10 09:04:45, Stephane Eranian wrote:
>> Don,
>>
>> Found your patch on LKML (I am not on it).
>>
>> In your changelog you said:
>>
>> > During testing of a patch to stop having the perf subsytem swallow nmis,
>> > it was uncovered that Nehalem boxes were randomly getting unknown nmis
>> > when using the perf tool.
>> >
>> > Moving the ack'ing of the PMI closer to when we get the status allows
>> > the hardware to properly re-set the PMU bit signaling another PMI was
>> > triggered during the processing of the first PMI. This allows the new
>> > logic for dealing with the shortcomings of multiple PMIs to handle the
>> > extra NMI by 'eat'ing it later.
>>
>> > Now one can wonder why are we getting a second PMI when we disable all
>> > the PMUs in the beginning of the NMI handler to prevent such a case, for
>> > that I do not know. But I know the fix below helps deal with this quirk.
>> >
>>
>> I am assuming you're talking about back-to-back NMIs here, not nested NMIs.
>> I don't quite understand the scenario here. Is it the case that you handled 1
>> overflow, and then right as you return from the interrupt, you get a second
>> PMI with a ovfl_status=0 ?
>>
>> What events did you measure? Which counters did you use?
>> Did you have HT turned on?
>
> It is related to this thread:
>
> http://lkml.org/lkml/2010/8/25/124
>
> Not acking the status immediately triggered an nmi, but the status was
> 0. Acking after reading and before processing the counters results in
> a non-zero status and thus, no empty nmi.
>
> -Robert
>
>>
>> > Tested on multiple Nehalems where the problem was occuring. With the
>> > patch, the code now loops a second time to handle the second PMI (whereas
>> > before it was not).
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists