[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250213211718.2406744-1-kan.liang@linux.intel.com>
Date: Thu, 13 Feb 2025 13:17:13 -0800
From: kan.liang@...ux.intel.com
To: peterz@...radead.org,
mingo@...hat.com,
acme@...nel.org,
namhyung@...nel.org,
irogers@...gle.com,
adrian.hunter@...el.com,
alexander.shishkin@...ux.intel.com,
linux-kernel@...r.kernel.org
Cc: ak@...ux.intel.com,
eranian@...gle.com,
dapeng1.mi@...ux.intel.com,
thomas.falcon@...el.com,
Kan Liang <kan.liang@...ux.intel.com>
Subject: [PATCH V3 0/5] Support auto counter reload
From: Kan Liang <kan.liang@...ux.intel.com>
Changes since V2:
- Rebase on top of several new features, e.g., counters snapshotting
feature. Rewrite the code for the ACR CPUID-enumeration, configuration
and late setup.
- Patch 1-3 are newly added for clean up.
Changes since V1:
- Add a check to the reload value which cannot exceeds the max period
- Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
- Update comments explain to case which the event->attr.config2 exceeds
the group size
The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.
The auto counter reload (ACR) feature takes samples when the relative
rate of two or more events exceeds some threshold, which provides the
fine-grained information at a low cost.
To support the feature, two sets of MSRs are introduced. For a given
counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
can cause a reload of that counter. The reload value is stored in the
IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
Counter Reload.
Example:
Here is the snippet of the mispredict.c. Since the array has a random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.
For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.
main()
{
...
for (i = 0; i < N; i++)
data[i] = rand() % 256;
...
/* Loop 1 */
for (k = 0; k < 50; k++)
for (i = 0; i < N; i++)
if (data[i] >= 64)
sum += data[i];
...
...
/* Loop 2 */
for (k = 0; k < 50; k++)
for (i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
...
}
Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually samples both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
-c 1000000 -- ./mispredict
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post-process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.
With this patch, the user can generate the samples only when the branch
miss rate > 20%. For example,
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
-- ./mispredict
(Two different periods are applied to branch-misses and
branch-instructions. The ratio is set to 20%.
If the branch-instructions is overflowed first, the branch-miss
rate < 20%. No samples should be generated. All counters should be
automatically reloaded.
If the branch-misses is overflowed first, the branch-miss rate > 20%.
A sample triggered by the branch-misses event should be
generated. Just the counter of the branch-instructions should be
automatically reloaded.
The branch-misses event should only be automatically reloaded when
the branch-instructions is overflowed. So the "cause" event is the
branch-instructions event. The acr_mask is set to 0x2, since the
event index of branch-instructions is 1.
The branch-instructions event is automatically reloaded no matter which
events are overflowed. So the "cause" events are the branch-misses
and the branch-instructions event. The acr_mask should be set to 0x3.)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
$perf report
Percent │154: movl $0x0,-0x14(%rbp)
│ ↓ jmp 1af
│ for (i = j; i < N; i++)
│15d: mov -0x10(%rbp),%eax
│ mov %eax,-0x18(%rbp)
│ ↓ jmp 1a2
│ if (data[i] >= 128)
│165: mov -0x18(%rbp),%eax
│ cltq
│ lea 0x0(,%rax,4),%rdx
│ mov -0x8(%rbp),%rax
│ add %rdx,%rax
│ mov (%rax),%eax
│ ┌──cmp $0x7f,%eax
100.00 0.00 │ ├──jle 19e
│ │sum += data[i];
The 2498 samples are all from the branch-misses events for the Loop 2.
The number of samples and overhead is significantly reduced without
losing any information.
Kan Liang (5):
perf/x86: Add dynamic constraint
perf/x86/intel: Track the num of events needs late setup
perf: Extend the bit width of the arch-specific flag
perf/x86/intel: Add CPUID enumeration for the auto counter reload
perf/x86/intel: Support auto counter reload
arch/x86/events/core.c | 3 +-
arch/x86/events/intel/core.c | 260 ++++++++++++++++++++++++++++-
arch/x86/events/intel/ds.c | 3 +-
arch/x86/events/intel/lbr.c | 2 +-
arch/x86/events/perf_event.h | 33 ++++
arch/x86/events/perf_event_flags.h | 41 ++---
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/perf_event.h | 1 +
include/linux/perf_event.h | 4 +-
9 files changed, 320 insertions(+), 31 deletions(-)
--
2.38.1
Powered by blists - more mailing lists