linux-kernel - Re: [PATCH v1 0/4] A mechanism for efficient support for per-function metrics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1fa53163-6651-4053-ad80-837d6bf92e6f@linaro.org>
Date: Thu, 14 Nov 2024 14:36:28 +0000
From: James Clark <james.clark@...aro.org>
To: Ian Rogers <irogers@...gle.com>, Deepak Surti <deepak.surti@....com>,
 Leo Yan <leo.yan@....com>
Cc: peterz@...radead.org, mingo@...hat.com, acme@...nel.org,
 namhyung@...nel.org, mark.barnett@....com, ben.gainey@....com,
 ak@...ux.intel.com, will@...nel.org, james.clark@....com,
 mark.rutland@....com, alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
 adrian.hunter@...el.com, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH v1 0/4] A mechanism for efficient support for per-function
 metrics



On 14/11/2024 2:22 am, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@....com> wrote:
>>
>> This patch introduces the concept on an alternating sample rate to perf
>> core and provides the necessary basic changes in the tools to activate
>> that option.
>>
>> This patchset was original posted by Ben Gainey out for RFC back in April,
>> the latest version of which can be found at
>> https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/.
>> Going forward, I will be owning this.
>>
>> The primary use case for this change is to be able to enable collecting
>> per-function performance metrics using the Arm PMU, as per the following
>> approach:
>>
>>   * Starting with a simple periodic sampling (hotspot) profile,
>>     augment each sample with PMU counters accumulated over a short window
>>     up to the point the sample was take.
>>   * For each sample, perform some filtering to improve attribution of
>>     the accumulated PMU counters (ensure they are attributed to a single
>>     function)
>>   * For each function accumulate a total for each PMU counter so that
>>     metrics may be derived.
>>
>> Without modification, and sampling at a typical rate associated
>> with hotspot profiling (~1mS) leads to poor results. Such an
>> approach gives you a reasonable estimation of where the profiled
>> application is spending time for relatively low overhead, but the
>> PMU counters cannot easily be attributed to a single function as the
>> window over which they are collected is too large. A modern CPU may
>> execute many millions of instructions over many thousands of functions
>> within 1mS window. With this approach, the per-function metrics tend
>> to trend to some average value across the top N functions in the
>> profile.
>>
>> In order to ensure a reasonable likelihood that the counters are
>> attributed to a single function, the sampling window must be rather
>> short; typically something in the order of a few hundred cycles proves
>> well as tested on a range of aarch64 Cortex and Neoverse cores.
>>
>> As it stands, it is possible to achieve this with perf using a very high
>> sampling rate (e.g ~300cy), but there are at least three major concerns
>> with this approach:
>>
>>   * For speculatively executing, out of order cores, can the results be
>>     accurately attributed to a give function or the given sample window?
>>   * A short sample window is not guaranteed to cover a single function.
>>   * The overhead of sampling every few hundred cycles is very high and
>>     is highly likely to cause throttling which is undesirable as it leads
>>     to patchy results; i.e. the profile alternates between periods of
>>     high frequency samples followed by longer periods of no samples.
>>
>> This patch does not address the first two points directly. Some means
>> to address those are discussed on the RFC v2 cover letter. The patch
>> focuses on addressing the final point, though happily this approach
>> gives us a way to perform basic filtering on the second point.
>>
>> The alternating sample period allows us to do two things:
>>
>>   * We can control the risk of throttling and reduce overhead by
>>     alternating between a long and short period. This allows us to
>>     decouple the "periodic" sampling rate (as might be used for hotspot
>>     profiling) from the short sampling window needed for collecting
>>     the PMU counters.
>>   * The sample taken at the end of the long period can be otherwise
>>     discarded (as the PMU data is not useful), but the
>>     PERF_RECORD_CALLCHAIN information can be used to identify the current
>>     function at the start of the short sample window. This is useful
>>     for filtering samples where the PMU counter data cannot be attributed
>>     to a single function.
> 
> I think this is interesting. I'm a little concerned on the approach as
> I wonder if a more flexible mechanism could be had.
> 
> One approach that wouldn't work would be to open high and low
> frequency events, or groups of events, then use BPF filters to try to
> replicate this approach by dropping most of the high frequency events.
> I don't think it would work as the high frequency sampling is likely
> going to trigger during the BPF filter execution, and the BPF filter
> would be too much overhead.
> 
> Perhaps another approach is to change the perf event period with a new
> BPF helper function that's called where we do the perf event
> filtering. There's the overhead of running the BPF code, but the BPF
> code could allow you to instead of alternating between two periods
> allow you to alternate between an arbitrary number of them.
> 
> Thanks,
> Ian
> 

There might be something to the arbitrary number of periods, because for 
a very short run you might want a high sample rate and for a long run 
you would want a low rate. BPF might allow you to reduce the rate over 
time so you don't have to worry about picking the right one so much.

+Leo because I think he's looking at linking BPF to the aux pause/resume 
patches [1] which could be similar.

[1]: 
https://lore.kernel.org/linux-perf-users/20241114101711.34987-1-adrian.hunter@intel.com/T/#t

>> There are several reasons why it is desirable to reduce the overhead and
>> risk of throttling:
>>
>>    * PMU counter overflow typically causes an interrupt into the kernel;
>>      this affects program runtime, and can affect things like branch
>>      prediction, cache locality and so on which can skew the metrics.
>>    * The very high sample rate produces significant amounts of data.
>>      Depending on the configuration of the profiling session and machine,
>>      it is easily possible to produce many orders of magnitude more data
>>      which is costly for tools to post-process and increases the chance
>>      of data loss. This is especially relevant on larger core count
>>      systems where it is very easy to produce massive recordings.
>>      Whilst the kernel will throttle such a configuration,
>>      which helps to mitigate a large portion of the bandwidth and capture
>>      overhead, it is not something that can be controlled for on a per
>>      event basis, or for non-root users, and because throttling is
>>      controlled as a percentage of time its affects vary from machine to
>>      machine. AIUI throttling may also produce an uneven temporal
>>      distribution of samples. Finally, whilst throttling does a good job
>>      at reducing the overall amount of data produced, it still leads to
>>      much larger captures than with this method; typically we have
>>      observed 1-2 orders of magnitude larger captures.
>>
>> This patch set modifies perf core to support alternating between two
>> sample_period values, providing a simple and inexpensive way for tools
>> to separate out the sample window (time over which events are
>> counted) from the sample period (time between interesting samples).
>>
>> It is expected to be used with the cycle counter event, alternating
>> between a long and short period and subsequently discarding the counter
>> data for samples with the long period. The combined long and short
>> period gives the overall sampling period, and the short sample period
>> gives the sample window. The symbol taken from the sample at the end of
>> the long period can be used by tools to ensure correct attribution as
>> described previously. The cycle counter is recommended as it provides
>> fair temporal distribution of samples as would be required for the
>> per-symbol sample count mentioned previously, and because the PMU can
>> be programmed to overflow after a sufficiently short window (which may
>> not be possible with software timer, for example). This patch does not
>> restrict to only the cycle counter, it is possible there could be other
>> novel uses based on different events, or more appropriate counters on
>> other architectures. This patch set does not modify or otherwise disable
>> the kernel's existing throttling behaviour; if a configuration is given
>> that would lead high CPU usage, then throttling still occurs.
>>
>>
>> To test this a simple `perf script` based python script was developed.
>> For a limited set of Arm PMU events it will post process a
>> `perf record`-ing and generate a table of metrics. Along side this a
>> benchmark application was developed that rotates through a sequence
>> of different classes of behaviour that can be detected by the Arm PMU
>> (eg. mispredicts, cache misses, different instruction mixes). The path
>> through the benchmark can be rotated after each iteration so as to
>> ensure the results don't land on some lucky harmonic with the sample
>> period. The script can be used with and without this patch allowing
>> comparison of the results. Testing was on Juno (A53+A57), N1SDP,
>> Gravaton 2 and 3. In addition this approach has been applied to a few
>> of Arm's tools projects and has correctly identified improvements and
>> regressions.
>>
>> Headline results from testing indicate that ~300 cycles sample window
>> gives good results with or without this patch. Typical output on N1SDP (Neoverse-N1)
>> for the provided benchmark when run as:
>>
>>      perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
>>          -k CLOCK_MONOTONIC_RAW \
>>          -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
>>          benchmark 0 1
>>
>>      perf script -s generate-function-metrics.py -- -s discard
>>
>> Looks like (reformatted for email brevity):
>>
>>      Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I    %BM   %L1DA  %L1DM
>>      fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8  22.9   0.1   0.6    0.0
>>      int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3  21.5   0.1   1.9    0.2
>>      isb                 3414  20.1   0.2     0.0   0.4  17.6   2.3   0.1   0.8    0.0
>>      branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2  99.0  71.6    0.1
>>      double_to_int        694   0.5   0.0     0.0   0.6   3.4  19.1   0.1   1.2    0.1
>>      nops                 417   0.3   0.2     0.0   2.8   1.9  18.3   0.6   0.4    0.1
>>      dcache_miss          185   3.6   0.4   184.7  53.8   0.7   0.5   0.0  18.4   99.1
>>
>> (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000 Instruction,
>>   CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
>>   accesses that miss, %CY = Percentage of total cycles, %I = Percentage
>>   of total instructions, %BM = Percentage of total branch mispredicts,
>>   %L1DA = Percentage of total cache accesses, %L1DM = Percentage of total
>>   cache misses)
>>
>> When the patch is used, the resulting `perf.data` files are typically
>> between 25-50x smaller than without, and take ~25x less time for the
>> python script to post-process. For example, running the following:
>>
>>      perf record -i -vvv -e '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
>>      perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS' benchmark 0 1
>>      perf record -i -vvv -e '{cycles/period=300/,instructions}:uS' benchmark 0 1
>>
>> produces captures on N1SDP (Neoverse-N1) of the following sizes:
>>
>>      * period=1000000: 2.601 MB perf.data (55780 samples), script time = 0m0.362s
>>      * period=1000: 283.749 MB perf.data (6162932 samples), script time = 0m33.100s
>>      * period=300: 304.281 MB perf.data (6614182 samples), script time = 0m35.826s
>>
>> The "script time" is the user time from running "time perf script -s generate-function-metrics.py"
>> on the recording. Similar processing times were observed for "time perf report --stdio|cat"
>> as well.
>>
>> By comparison, with the patch active:
>>
>>      perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
>>
>> produces 4.923 MB perf.data (107512 samples), and script time = 0m0.578s.
>> Which is as expected ~2x the size and ~2x the number of samples as per
>> the period=1000000 recording. When compared to the period=300 recording,
>> the results from the provided post-processing script are (within margin
>> of error) the same, but the data file is ~62x smaller. The same affect
>> is seen for the post-processing script runtime.
>>
>> Notably, without the patch enable, L1D cache miss rates are often higher
>> than with, which we attribute to increased impact on the cache that
>> trapping into the kernel every 300 cycles has.
>>
>> These results are given with `perf_cpu_time_max_percent=25`. When tested
>> with `perf_cpu_time_max_percent=100` the size and time comparisons are
>> more significant. Disabling throttling did not lead to obvious
>> improvements in the collected metrics, suggesting that the sampling
>> approach is sufficient to collect representative metrics.
>>
>> Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
>> window (with and without the patch) suggests this approach might work
>> for some counters. Using the same test script, it was possible to identify
>> branch mispredicts correctly. However, whilst the patch is functionally
>> correct, differences in the architectures may mean that this approach it
>> enables does not apply as a means to collect per-function metrics on x86.
>>
>> Changes since RFC v2:
>>   - Rebased on v6.12-rc6.
>>
>> Changes since RFC v1:
>>   - Rebased on v6.9-rc1.
>>   - Refactored from arm_pmu based extension to core feature
>>   - Added the ability to jitter the sample window based on feedback
>>     from Andi Kleen.
>>   - Modified perf tool to parse the "alt-period" and "alt-period-jitter"
>>     terms in the event specification.
>>
>> Ben Gainey (4):
>>    perf: Allow periodic events to alternate between two sample periods
>>    perf: Allow adding fixed random jitter to the alternate sampling
>>      period
>>    tools/perf: Modify event parser to support alt-period term
>>    tools/perf: Modify event parser to support alt-period-jitter term
>>
>>   include/linux/perf_event.h                    |  5 ++
>>   include/uapi/linux/perf_event.h               | 13 ++++-
>>   kernel/events/core.c                          | 47 +++++++++++++++++++
>>   tools/include/uapi/linux/perf_event.h         | 13 ++++-
>>   tools/perf/tests/attr.c                       |  2 +
>>   tools/perf/tests/attr.py                      |  2 +
>>   tools/perf/tests/attr/base-record             |  4 +-
>>   tools/perf/tests/attr/base-record-spe         |  2 +
>>   tools/perf/tests/attr/base-stat               |  4 +-
>>   tools/perf/tests/attr/system-wide-dummy       |  4 +-
>>   .../attr/test-record-alt-period-jitter-term   | 13 +++++
>>   .../tests/attr/test-record-alt-period-term    | 12 +++++
>>   tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
>>   tools/perf/util/parse-events.c                | 30 ++++++++++++
>>   tools/perf/util/parse-events.h                |  4 +-
>>   tools/perf/util/parse-events.l                |  2 +
>>   tools/perf/util/perf_event_attr_fprintf.c     |  1 +
>>   tools/perf/util/pmu.c                         |  2 +
>>   18 files changed, 157 insertions(+), 7 deletions(-)
>>   create mode 100644 tools/perf/tests/attr/test-record-alt-period-jitter-term
>>   create mode 100644 tools/perf/tests/attr/test-record-alt-period-term
>>
>> --
>> 2.43.0
>>
>