linux-kernel - Re: [PATCH] perf/x86/intel: Restrict period on Haswell

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70657c5e-f771-456b-a5ac-3df590249288@linux.intel.com>
Date: Wed, 14 Aug 2024 15:37:59 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Thomas Gleixner <tglx@...utronix.de>, Li Huafei <lihuafei1@...wei.com>,
 peterz@...radead.org, mingo@...hat.com
Cc: acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
 alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com,
 adrian.hunter@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
 x86@...nel.org, hpa@...or.com, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf/x86/intel: Restrict period on Haswell



On 2024-08-14 3:01 p.m., Thomas Gleixner wrote:
> On Wed, Aug 14 2024 at 14:15, Kan Liang wrote:
>> On 2024-08-14 10:52 a.m., Thomas Gleixner wrote:
>>> Now looking at the HSW specification update specifically erratum HSW11:
>>>
>>>   Performance Monitor Precise Instruction Retired Event May Present
>>>   Wrong Indications
>>>
>>>   Problem:
>>>          When the Precise Distribution for Instructions Retired (PDIR)
>>>          mechanism is activated (INST_RETIRED.ALL (event C0H, umask
>>>          value 00H) on Counter 1 programmed in PEBS mode), the processor
>>>          may return wrong PEBS or Performance Monitoring Interrupt (PMI)
>>>          interrupts and/or incorrect counter values if the counter is
>>>          reset with a Sample- After-Value (SAV) below 100 (the SAV is
>>>          the counter reset value software programs in the MSR
>>>          IA32_PMC1[47:0] in order to control interrupt frequency).
>>>
>>>   Implication:
>>>          Due to this erratum, when using low SAV values, the program may
>>>          get incorrect PEBS or PMI interrupts and/or an invalid counter
>>>          state.
>>>
>>>   Workaround:
>>>          The sampling driver should avoid using SAV<100.
>>>
>>> IOW, that's exactly the same issue as the BDM11 erratum.
>>>
>>> Kan: Can you please go through the various specification updates and
>>> identify which generations are affected by this and fix it once and
>>> forever in a sane way instead of relying on 'tried until it works by
>>> some definition of works' hacks. These errata are there for a reason.
>>
>> Sure. I will check all the related erratum and propose a fix.
>>
>>> But that does not explain the fallout with that cve test because that
>>> does not use PEBS. It's using fixed counter 0.
>>
>> The errata also mentions about the PMI interrupts, which may imply
>> non-PEBS case. I will double check with the architect.
> 
> Ah. Indeed.
> 
>> According to the description of the patch, if I understand correctly, it
>> runs 100 CVE-2015-3290 tests at the same time. If so, all the GP
>> counters are used. Huafei, could you please confirm?
> 
> I can reproduce that way on my quad socket HSW almost instantaneously:
> 
> [10473.376928] CPU#16: ctrl:       0000000000000000
> [10473.376930] CPU#16: status:     0000000000000000
> [10473.376931] CPU#16: overflow:   0000000000000000
> [10473.376932] CPU#16: fixed:      00000000000000bb
> [10473.376933] CPU#16: pebs:       0000000000000000
> [10473.376934] CPU#16: debugctl:   0000000000004000
> [10473.376935] CPU#16: active:     0000000300000000
> [10473.376937] CPU#16:   gen-PMC0 ctrl:  0000000000134f2e
> [10473.376938] CPU#16:   gen-PMC0 count: 0000ffffffffffca
> [10473.376940] CPU#16:   gen-PMC0 left:  000000000000003b
> [10473.376941] CPU#16:   gen-PMC1 ctrl:  0000000000000000
> [10473.376943] CPU#16:   gen-PMC1 count: 0000000000000000
> [10473.376944] CPU#16:   gen-PMC1 left:  0000000000000000
> [10473.376946] CPU#16:   gen-PMC2 ctrl:  0000000000000000
> [10473.376947] CPU#16:   gen-PMC2 count: 0000000000000000
> [10473.376948] CPU#16:   gen-PMC2 left:  0000000000000000
> [10473.376949] CPU#16:   gen-PMC3 ctrl:  0000000000000000
> [10473.376950] CPU#16:   gen-PMC3 count: 0000000000000000
> [10473.376952] CPU#16:   gen-PMC3 left:  0000000000000000
> [10473.376953] CPU#16: fixed-PMC0 count: 0000fffffffffffe
> [10473.376954] CPU#16: fixed-PMC1 count: 0000fffbabf57908
> [10473.376955] CPU#16: fixed-PMC2 count: 0000000000000000
> 
> [10473.376928] CPU#88: ctrl:       0000000000000000
> [10473.376930] CPU#88: status:     0000000000000000
> [10473.376931] CPU#88: overflow:   0000000000000000
> [10473.376932] CPU#88: fixed:      00000000000000bb
> [10473.376933] CPU#88: pebs:       0000000000000000
> [10473.376934] CPU#88: debugctl:   0000000000004000
> [10473.376935] CPU#88: active:     0000000300000000
> [10473.376937] CPU#88:   gen-PMC0 ctrl:  0000000000134f2e
> [10473.376939] CPU#88:   gen-PMC0 count: 0000fffffffffff2
> [10473.376940] CPU#88:   gen-PMC0 left:  00000000000000a8
> [10473.376942] CPU#88:   gen-PMC1 ctrl:  0000000000000000
> [10473.376944] CPU#88:   gen-PMC1 count: 0000000000000000
> [10473.376945] CPU#88:   gen-PMC1 left:  0000000000000000
> [10473.376946] CPU#88:   gen-PMC2 ctrl:  0000000000000000
> [10473.376947] CPU#88:   gen-PMC2 count: 0000000000000000
> [10473.376949] CPU#88:   gen-PMC2 left:  0000000000000000
> [10473.376950] CPU#88:   gen-PMC3 ctrl:  0000000000000000
> [10473.376951] CPU#88:   gen-PMC3 count: 0000000000000000
> [10473.376952] CPU#88:   gen-PMC3 left:  0000000000000000
> [10473.376953] CPU#88: fixed-PMC0 count: 0000fffffffffffe
> [10473.376955] CPU#88: fixed-PMC1 count: 0000fffa79a83958
> [10473.376956] CPU#88: fixed-PMC2 count: 0000000000000000
> 
> This happens at the very same time and CPU#88 is the HT sibling of
> CPU#16
> 

The fixed counter 0 is used which doesn't match of what the HSW11
describes. I will check if the HSW11 missed the case, or if there is
another issue.

Thanks,
Kan