[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <41986677e1d2d20da80d3e7366034744b18a1e56.camel@intel.com>
Date: Wed, 26 Feb 2025 21:43:04 +0000
From: "Falcon, Thomas" <thomas.falcon@...el.com>
To: "alexander.shishkin@...ux.intel.com" <alexander.shishkin@...ux.intel.com>,
"peterz@...radead.org" <peterz@...radead.org>, "acme@...nel.org"
<acme@...nel.org>, "mingo@...hat.com" <mingo@...hat.com>,
"kan.liang@...ux.intel.com" <kan.liang@...ux.intel.com>, "Hunter, Adrian"
<adrian.hunter@...el.com>, "namhyung@...nel.org" <namhyung@...nel.org>,
"irogers@...gle.com" <irogers@...gle.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
CC: "dapeng1.mi@...ux.intel.com" <dapeng1.mi@...ux.intel.com>,
"ak@...ux.intel.com" <ak@...ux.intel.com>, "Eranian, Stephane"
<eranian@...gle.com>
Subject: Re: [PATCH V3 0/5] Support auto counter reload
On Thu, 2025-02-13 at 13:17 -0800, kan.liang@...ux.intel.com wrote:
> From: Kan Liang <kan.liang@...ux.intel.com>
>
> Changes since V2:
> - Rebase on top of several new features, e.g., counters snapshotting
> feature. Rewrite the code for the ACR CPUID-enumeration,
> configuration
> and late setup.
> - Patch 1-3 are newly added for clean up.
>
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2
> exceeds
> the group size
>
> The relative rates among two or more events are useful for
> performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
>
> The auto counter reload (ACR) feature takes samples when the relative
> rate of two or more events exceeds some threshold, which provides the
> fine-grained information at a low cost.
> To support the feature, two sets of MSRs are introduced. For a given
> counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
> IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
> can cause a reload of that counter. The reload value is stored in the
> IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
> The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
> Counter Reload.
Works for me on an Core Ultra 9 275HX.
Tested-by: Thomas Falcon <thomas.falcon@...el.com>
Tom
>
> Example:
>
> Here is the snippet of the mispredict.c. Since the array has a random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
>
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
>
> main()
> {
> ...
> for (i = 0; i < N; i++)
> data[i] = rand() % 256;
> ...
> /* Loop 1 */
> for (k = 0; k < 50; k++)
> for (i = 0; i < N; i++)
> if (data[i] >= 64)
> sum += data[i];
> ...
>
> ...
> /* Loop 2 */
> for (k = 0; k < 50; k++)
> for (i = 0; i < N; i++)
> if (data[i] >= 128)
> sum += data[i];
> ...
> }
>
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional
> method
> usually samples both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-
> instructions/u}"
> -c 1000000 -- ./mispredict
>
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post-process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
>
> With this patch, the user can generate the samples only when the
> branch
> miss rate > 20%. For example,
> perf record -e "{cpu_atom/branch-
> misses,period=200000,acr_mask=0x2/ppu,
> cpu_atom/branch-
> instructions,period=1000000,acr_mask=0x3/u}"
> -- ./mispredict
>
> (Two different periods are applied to branch-misses and
> branch-instructions. The ratio is set to 20%.
> If the branch-instructions is overflowed first, the branch-miss
> rate < 20%. No samples should be generated. All counters should be
> automatically reloaded.
> If the branch-misses is overflowed first, the branch-miss rate > 20%.
> A sample triggered by the branch-misses event should be
> generated. Just the counter of the branch-instructions should be
> automatically reloaded.
>
> The branch-misses event should only be automatically reloaded when
> the branch-instructions is overflowed. So the "cause" event is the
> branch-instructions event. The acr_mask is set to 0x2, since the
> event index of branch-instructions is 1.
>
> The branch-instructions event is automatically reloaded no matter
> which
> events are overflowed. So the "cause" events are the branch-misses
> and the branch-instructions event. The acr_mask should be set to
> 0x3.)
>
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
>
> $perf report
>
> Percent │154: movl $0x0,-0x14(%rbp)
> │ ↓ jmp 1af
> │ for (i = j; i < N; i++)
> │15d: mov -0x10(%rbp),%eax
> │ mov %eax,-0x18(%rbp)
> │ ↓ jmp 1a2
> │ if (data[i] >= 128)
> │165: mov -0x18(%rbp),%eax
> │ cltq
> │ lea 0x0(,%rax,4),%rdx
> │ mov -0x8(%rbp),%rax
> │ add %rdx,%rax
> │ mov (%rax),%eax
> │ ┌──cmp $0x7f,%eax
> 100.00 0.00 │ ├──jle 19e
> │ │sum += data[i];
>
> The 2498 samples are all from the branch-misses events for the Loop
> 2.
>
> The number of samples and overhead is significantly reduced without
> losing any information.
>
> Kan Liang (5):
> perf/x86: Add dynamic constraint
> perf/x86/intel: Track the num of events needs late setup
> perf: Extend the bit width of the arch-specific flag
> perf/x86/intel: Add CPUID enumeration for the auto counter reload
> perf/x86/intel: Support auto counter reload
>
> arch/x86/events/core.c | 3 +-
> arch/x86/events/intel/core.c | 260
> ++++++++++++++++++++++++++++-
> arch/x86/events/intel/ds.c | 3 +-
> arch/x86/events/intel/lbr.c | 2 +-
> arch/x86/events/perf_event.h | 33 ++++
> arch/x86/events/perf_event_flags.h | 41 ++---
> arch/x86/include/asm/msr-index.h | 4 +
> arch/x86/include/asm/perf_event.h | 1 +
> include/linux/perf_event.h | 4 +-
> 9 files changed, 320 insertions(+), 31 deletions(-)
>
Powered by blists - more mailing lists