linux-kernel - Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event on intel perf counter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTikOaCL8FqQuUQsYPxm19WZOdarp8AMAugN0mnqQ@mail.gmail.com>
Date:	Thu, 2 Sep 2010 10:13:19 +0200
From:	Stephane Eranian <eranian@...gle.com>
To:	Robert Richter <robert.richter@....com>
Cc:	Don Zickus <dzickus@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...e.hu" <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second event
 on intel perf counter

Robert,

Do you have the test program you used to test this?
I believe the NHM hack does not solve the problem, it
just makes it harder to appear.

I suspect the real issue is that the GLOBAL_STATUS
bitmask cannot be trusted. I'd like to verify this.

Has the problem appear only on Nehalem or also on
Westmere?


On Wed, Sep 1, 2010 at 4:57 PM, Robert Richter <robert.richter@....com> wrote:
> On 01.09.10 09:04:45, Stephane Eranian wrote:
>> Don,
>>
>> Found your patch on LKML (I am not on it).
>>
>> In your changelog you said:
>>
>> > During testing of a patch to stop having the perf subsytem swallow nmis,
>> > it was uncovered that Nehalem boxes were randomly getting unknown nmis
>> > when using the perf tool.
>> >
>> > Moving the ack'ing of the PMI closer to when we get the status allows
>> > the hardware to properly re-set the PMU bit signaling another PMI was
>> > triggered during the processing of the first PMI.  This allows the new
>> > logic for dealing with the shortcomings of multiple PMIs to handle the
>> > extra NMI by 'eat'ing it later.
>>
>> > Now one can wonder why are we getting a second PMI when we disable all
>> > the PMUs in the beginning of the NMI handler to prevent such a case, for
>> > that I do not know.  But I know the fix below helps deal with this quirk.
>> >
>>
>> I am assuming you're talking about back-to-back NMIs here, not nested NMIs.
>> I don't quite understand the scenario here. Is it the case that you handled 1
>> overflow, and then right as you return from the interrupt, you get a second
>> PMI with a ovfl_status=0 ?
>>
>> What events did you measure? Which counters did you use?
>> Did you have HT turned on?
>
> It is related to this thread:
>
>  http://lkml.org/lkml/2010/8/25/124
>
> Not acking the status immediately triggered an nmi, but the status was
> 0. Acking after reading and before processing the counters results in
> a non-zero status and thus, no empty nmi.
>
> -Robert
>
>>
>> > Tested on multiple Nehalems where the problem was occuring.  With the
>> > patch, the code now loops a second time to handle the second PMI (whereas
>> > before it was not).
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/