linux-kernel - Re: [PATCH V2 0/3] Support auto counter reload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7fa16a12-2b31-4bab-893e-ba4fe017339b@linux.intel.com>
Date: Mon, 4 Nov 2024 15:37:08 -0500
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: peterz@...radead.org, mingo@...nel.org, acme@...nel.org,
 namhyung@...nel.org, irogers@...gle.com, adrian.hunter@...el.com,
 ak@...ux.intel.com, linux-kernel@...r.kernel.org
Cc: eranian@...gle.com, thomas.falcon@...el.com
Subject: Re: [PATCH V2 0/3] Support auto counter reload

Hi Peter,

Ping. Could you please let me know if you have any comments.

Thanks,
Kan

On 2024-10-10 3:28 p.m., kan.liang@...ux.intel.com wrote:
> From: Kan Liang <kan.liang@...ux.intel.com>
> 
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2 exceeds
>   the group size
> 
> The relative rates among two or more events are useful for performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
> 
> Auto Counter Reload (ACR) provides a means for software to specify that,
> for each supported counter, the hardware should automatically reload the
> counter to a specified initial value upon overflow of chosen counters.
> This mechanism enables software to sample based on the relative rate of
> two (or more) events, such that a sample (PMI or PEBS) is taken only if
> the rate of one event exceeds some threshold relative to the rate of
> another event. Taking a PMI or PEBS only when the relative rate of
> perfmon events crosses a threshold can have significantly less
> performance overhead than other techniques.
> 
> The details can be found at Intel Architecture Instruction Set
> Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.
> 
> Examples:
> 
> Here is the snippet of the mispredict.c. Since the array has random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
> 
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
> 
> main()
> {
> ...
>         for (i = 0; i < N; i++)
>                 data[i] = rand() % 256;
> ...
>         /* Loop 1 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 64)
>                                 sum += data[i];
> ...
> 
> ...
>         /* Loop 2 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 128)
>                                 sum += data[i];
> ...
> }
> 
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional method
> usually sample both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
>                -c 1000000 -- ./mispredict
> 
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
> 
> With this patch, the user can generate the samples only when the branch
> miss rate > 20%.
> perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
>                  cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
>                 -- ./mispredict
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
> 
>  $perf report
> 
> Percent       │154:   movl    $0x0,-0x14(%rbp)
>               │     ↓ jmp     1af
>               │     for (i = j; i < N; i++)
>               │15d:   mov     -0x10(%rbp),%eax
>               │       mov     %eax,-0x18(%rbp)
>               │     ↓ jmp     1a2
>               │     if (data[i] >= 128)
>               │165:   mov     -0x18(%rbp),%eax
>               │       cltq
>               │       lea     0x0(,%rax,4),%rdx
>               │       mov     -0x8(%rbp),%rax
>               │       add     %rdx,%rax
>               │       mov     (%rax),%eax
>               │    ┌──cmp     $0x7f,%eax
> 100.00   0.00 │    ├──jle     19e
>               │    │sum += data[i];
> 
> The 2498 samples are all from the branch-misses events for the Loop 2.
> 
> The number of samples and overhead is significantly reduced without
> losing any information.
> 
> Kan Liang (3):
>   perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
>   perf/x86/intel: Add the enumeration and flag for the auto counter
>     reload
>   perf/x86/intel: Support auto counter reload
> 
>  arch/x86/events/intel/core.c       | 262 ++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h       |  21 +++
>  arch/x86/events/perf_event_flags.h |   2 +-
>  arch/x86/include/asm/msr-index.h   |   4 +
>  arch/x86/include/asm/perf_event.h  |   4 +-
>  include/linux/perf_event.h         |   2 +
>  6 files changed, 288 insertions(+), 7 deletions(-)
>