linux-kernel - Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMsRxf+18qz_vOkEZ1a8D9Z7BywWZPNB=qEn0bHXMFg96sALTQ@mail.gmail.com>
Date:	Tue, 10 Sep 2013 07:34:41 -0700
From:	Stephane Eranian <eranian@...glemail.com>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andi Kleen <andi@...stfloor.org>
Subject: Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re:
 [GIT PULL] perf changes for v3.12)

On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> * Stephane Eranian <eranian@...glemail.com> wrote:
>
>> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@...nel.org> wrote:
>> >
>> > * Stephane Eranian <eranian@...glemail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Ok, so I am able to reproduce the problem using a simpler
>> >> test case with a simple multithreaded program where
>> >> #threads >> #CPUs.
>> >
>> > Does it go away if you use 'perf record --all-cpus'?
>> >
>> Haven't tried that yet.
>>
>> But I verified the DS pointers:
>> init:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>> crash:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>>
>> Neither the base nor the max are modified.
>> The index simply goes beyond the threshold but that's not a bug.
>> It is 12 after the threshold of 1, so total 13 is my new crash report.
>>
>> Two things to try:
>> - measure only one thread/core
>> - move the threshold a bit farther away (to get 2 or 3 entries)
>>
>> The threshold is where to generate the interrupt. It does not mean where
>> to stop PEBS recording. So it is possible that in HSW, we may get into a
>> situation where it takes time to get to the handler to stop the PMU. I
>> don't know how given we use NMI. Well, unless we were already servicing
>> an NMI at the time. But given that we stop the PMU almost immediately in
>> the handler, I don't see how that would possible. The other oddity in
>> HSW is that we clear the NMI on entry to the handler and not at the end.
>> I never gotten an good explanation as to why that was necessary. So
>> maybe it is related...
>
> Do you mean:
>
>         if (!x86_pmu.late_ack)
>                 apic_write(APIC_LVTPC, APIC_DM_NMI);
>
> AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
> before return, after we've processed the PMU.
>
Yeah, the opposity, I got confused.

Let me try reverting that.
Also curious about the influence of the LBR here.

> Do the symptoms change if you remove the x86_pmu.late_ack setting line
> from:
>
>         case 60: /* Haswell Client */
>         case 70:
>         case 71:
>         case 63:
>         case 69:
>                 x86_pmu.late_ack = true;
>
> ?
>
> Thanks,
>
>         Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/