linux-kernel - Re: [PATCH v1 0/4] A mechanism for efficient support for per-function metrics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6bd86f62a3d2761d339d42140106cfd061adb87.camel@arm.com>
Date: Mon, 25 Nov 2024 17:05:07 +0000
From: Deepak Surti <Deepak.Surti@....com>
To: "irogers@...gle.com" <irogers@...gle.com>
CC: Ben Gainey <Ben.Gainey@....com>, "alexander.shishkin@...ux.intel.com"
	<alexander.shishkin@...ux.intel.com>, Mark Barnett <Mark.Barnett@....com>,
	James Clark <James.Clark@....com>, "adrian.hunter@...el.com"
	<adrian.hunter@...el.com>, "ak@...ux.intel.com" <ak@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...hat.com" <mingo@...hat.com>, "linux-perf-users@...r.kernel.org"
	<linux-perf-users@...r.kernel.org>, "will@...nel.org" <will@...nel.org>, Mark
 Rutland <Mark.Rutland@....com>, "peterz@...radead.org"
	<peterz@...radead.org>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "acme@...nel.org" <acme@...nel.org>,
	"jolsa@...nel.org" <jolsa@...nel.org>, "namhyung@...nel.org"
	<namhyung@...nel.org>
Subject: Re: [PATCH v1 0/4] A mechanism for efficient support for per-function
 metrics

On Wed, 2024-11-13 at 18:22 -0800, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@....com>
> wrote:
> > 
> > This patch introduces the concept on an alternating sample rate to
> > perf
> > core and provides the necessary basic changes in the tools to
> > activate
> > that option.
> > 
> > This patchset was original posted by Ben Gainey out for RFC back in
> > April,
> > the latest version of which can be found at
> > https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/
> > .
> > Going forward, I will be owning this.
> > 
> > The primary use case for this change is to be able to enable
> > collecting
> > per-function performance metrics using the Arm PMU, as per the
> > following
> > approach:
> > 
> >  * Starting with a simple periodic sampling (hotspot) profile,
> >    augment each sample with PMU counters accumulated over a short
> > window
> >    up to the point the sample was take.
> >  * For each sample, perform some filtering to improve attribution
> > of
> >    the accumulated PMU counters (ensure they are attributed to a
> > single
> >    function)
> >  * For each function accumulate a total for each PMU counter so
> > that
> >    metrics may be derived.
> > 
> > Without modification, and sampling at a typical rate associated
> > with hotspot profiling (~1mS) leads to poor results. Such an
> > approach gives you a reasonable estimation of where the profiled
> > application is spending time for relatively low overhead, but the
> > PMU counters cannot easily be attributed to a single function as
> > the
> > window over which they are collected is too large. A modern CPU may
> > execute many millions of instructions over many thousands of
> > functions
> > within 1mS window. With this approach, the per-function metrics
> > tend
> > to trend to some average value across the top N functions in the
> > profile.
> > 
> > In order to ensure a reasonable likelihood that the counters are
> > attributed to a single function, the sampling window must be rather
> > short; typically something in the order of a few hundred cycles
> > proves
> > well as tested on a range of aarch64 Cortex and Neoverse cores.
> > 
> > As it stands, it is possible to achieve this with perf using a very
> > high
> > sampling rate (e.g ~300cy), but there are at least three major
> > concerns
> > with this approach:
> > 
> >  * For speculatively executing, out of order cores, can the results
> > be
> >    accurately attributed to a give function or the given sample
> > window?
> >  * A short sample window is not guaranteed to cover a single
> > function.
> >  * The overhead of sampling every few hundred cycles is very high
> > and
> >    is highly likely to cause throttling which is undesirable as it
> > leads
> >    to patchy results; i.e. the profile alternates between periods
> > of
> >    high frequency samples followed by longer periods of no samples.
> > 
> > This patch does not address the first two points directly. Some
> > means
> > to address those are discussed on the RFC v2 cover letter. The
> > patch
> > focuses on addressing the final point, though happily this approach
> > gives us a way to perform basic filtering on the second point.
> > 
> > The alternating sample period allows us to do two things:
> > 
> >  * We can control the risk of throttling and reduce overhead by
> >    alternating between a long and short period. This allows us to
> >    decouple the "periodic" sampling rate (as might be used for
> > hotspot
> >    profiling) from the short sampling window needed for collecting
> >    the PMU counters.
> >  * The sample taken at the end of the long period can be otherwise
> >    discarded (as the PMU data is not useful), but the
> >    PERF_RECORD_CALLCHAIN information can be used to identify the
> > current
> >    function at the start of the short sample window. This is useful
> >    for filtering samples where the PMU counter data cannot be
> > attributed
> >    to a single function.
> 
> 
Hi Ian,

> I think this is interesting. I'm a little concerned on the approach
> as
> I wonder if a more flexible mechanism could be had.
> 
> One approach that wouldn't work would be to open high and low
> frequency events, or groups of events, then use BPF filters to try to
> replicate this approach by dropping most of the high frequency
> events.
> I don't think it would work as the high frequency sampling is likely
> going to trigger during the BPF filter execution, and the BPF filter
> would be too much overhead.
> 
> Perhaps another approach is to change the perf event period with a
> new
> BPF helper function that's called where we do the perf event
> filtering. There's the overhead of running the BPF code, but the BPF
> code could allow you to instead of alternating between two periods
> allow you to alternate between an arbitrary number of them.

As you note, just using BPF to filter high frequency samples is
probably too much overhead.

But I think more generally the issue with BPF is that it would impose
additional permissions restrictions vs perf-only. That is to say, in
many cases BPF is restricted to root-only, whereas the approach of
perf-only would support application profiling for non-root users. For
our use-case in particular, this is more important than the theoretical
additional flexibility that BPF could bring.

Thanks,
Deepak

> Thanks,
> Ian
> 
> > There are several reasons why it is desirable to reduce the
> > overhead and
> > risk of throttling:
> > 
> >   * PMU counter overflow typically causes an interrupt into the
> > kernel;
> >     this affects program runtime, and can affect things like branch
> >     prediction, cache locality and so on which can skew the
> > metrics.
> >   * The very high sample rate produces significant amounts of data.
> >     Depending on the configuration of the profiling session and
> > machine,
> >     it is easily possible to produce many orders of magnitude more
> > data
> >     which is costly for tools to post-process and increases the
> > chance
> >     of data loss. This is especially relevant on larger core count
> >     systems where it is very easy to produce massive recordings.
> >     Whilst the kernel will throttle such a configuration,
> >     which helps to mitigate a large portion of the bandwidth and
> > capture
> >     overhead, it is not something that can be controlled for on a
> > per
> >     event basis, or for non-root users, and because throttling is
> >     controlled as a percentage of time its affects vary from
> > machine to
> >     machine. AIUI throttling may also produce an uneven temporal
> >     distribution of samples. Finally, whilst throttling does a good
> > job
> >     at reducing the overall amount of data produced, it still leads
> > to
> >     much larger captures than with this method; typically we have
> >     observed 1-2 orders of magnitude larger captures.
> > 
> > This patch set modifies perf core to support alternating between
> > two
> > sample_period values, providing a simple and inexpensive way for
> > tools
> > to separate out the sample window (time over which events are
> > counted) from the sample period (time between interesting samples).
> > 
> > It is expected to be used with the cycle counter event, alternating
> > between a long and short period and subsequently discarding the
> > counter
> > data for samples with the long period. The combined long and short
> > period gives the overall sampling period, and the short sample
> > period
> > gives the sample window. The symbol taken from the sample at the
> > end of
> > the long period can be used by tools to ensure correct attribution
> > as
> > described previously. The cycle counter is recommended as it
> > provides
> > fair temporal distribution of samples as would be required for the
> > per-symbol sample count mentioned previously, and because the PMU
> > can
> > be programmed to overflow after a sufficiently short window (which
> > may
> > not be possible with software timer, for example). This patch does
> > not
> > restrict to only the cycle counter, it is possible there could be
> > other
> > novel uses based on different events, or more appropriate counters
> > on
> > other architectures. This patch set does not modify or otherwise
> > disable
> > the kernel's existing throttling behaviour; if a configuration is
> > given
> > that would lead high CPU usage, then throttling still occurs.
> > 
> > 
> > To test this a simple `perf script` based python script was
> > developed.
> > For a limited set of Arm PMU events it will post process a
> > `perf record`-ing and generate a table of metrics. Along side this
> > a
> > benchmark application was developed that rotates through a sequence
> > of different classes of behaviour that can be detected by the Arm
> > PMU
> > (eg. mispredicts, cache misses, different instruction mixes). The
> > path
> > through the benchmark can be rotated after each iteration so as to
> > ensure the results don't land on some lucky harmonic with the
> > sample
> > period. The script can be used with and without this patch allowing
> > comparison of the results. Testing was on Juno (A53+A57), N1SDP,
> > Gravaton 2 and 3. In addition this approach has been applied to a
> > few
> > of Arm's tools projects and has correctly identified improvements
> > and
> > regressions.
> > 
> > Headline results from testing indicate that ~300 cycles sample
> > window
> > gives good results with or without this patch. Typical output on
> > N1SDP (Neoverse-N1)
> > for the provided benchmark when run as:
> > 
> >     perf record -T --sample-cpu --call-graph fp,4 --user-callchains
> > \
> >         -k CLOCK_MONOTONIC_RAW \
> >         -e '{cycles/period=999700,alt-
> > period=300/,instructions,branch-misses,cache-references,cache-
> > misses}:uS' \
> >         benchmark 0 1
> > 
> >     perf script -s generate-function-metrics.py -- -s discard
> > 
> > Looks like (reformatted for email brevity):
> > 
> >     Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I   
> > %BM   %L1DA  %L1DM
> >     fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8 
> > 22.9   0.1   0.6    0.0
> >     int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3 
> > 21.5   0.1   1.9    0.2
> >     isb                 3414  20.1   0.2     0.0   0.4  17.6  
> > 2.3   0.1   0.8    0.0
> >     branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2 
> > 99.0  71.6    0.1
> >     double_to_int        694   0.5   0.0     0.0   0.6   3.4 
> > 19.1   0.1   1.2    0.1
> >     nops                 417   0.3   0.2     0.0   2.8   1.9 
> > 18.3   0.6   0.4    0.1
> >     dcache_miss          185   3.6   0.4   184.7  53.8   0.7  
> > 0.5   0.0  18.4   99.1
> > 
> > (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000
> > Instruction,
> >  CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
> >  accesses that miss, %CY = Percentage of total cycles, %I =
> > Percentage
> >  of total instructions, %BM = Percentage of total branch
> > mispredicts,
> >  %L1DA = Percentage of total cache accesses, %L1DM = Percentage of
> > total
> >  cache misses)
> > 
> > When the patch is used, the resulting `perf.data` files are
> > typically
> > between 25-50x smaller than without, and take ~25x less time for
> > the
> > python script to post-process. For example, running the following:
> > 
> >     perf record -i -vvv -e
> > '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
> >     perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS'
> > benchmark 0 1
> >     perf record -i -vvv -e '{cycles/period=300/,instructions}:uS'
> > benchmark 0 1
> > 
> > produces captures on N1SDP (Neoverse-N1) of the following sizes:
> > 
> >     * period=1000000: 2.601 MB perf.data (55780 samples), script
> > time = 0m0.362s
> >     * period=1000: 283.749 MB perf.data (6162932 samples), script
> > time = 0m33.100s
> >     * period=300: 304.281 MB perf.data (6614182 samples), script
> > time = 0m35.826s
> > 
> > The "script time" is the user time from running "time perf script -
> > s generate-function-metrics.py"
> > on the recording. Similar processing times were observed for "time
> > perf report --stdio|cat"
> > as well.
> > 
> > By comparison, with the patch active:
> > 
> >     perf record -i -vvv -e '{cycles/period=999700,alt-
> > period=300/,instructions}:uS' benchmark 0 1
> > 
> > produces 4.923 MB perf.data (107512 samples), and script time =
> > 0m0.578s.
> > Which is as expected ~2x the size and ~2x the number of samples as
> > per
> > the period=1000000 recording. When compared to the period=300
> > recording,
> > the results from the provided post-processing script are (within
> > margin
> > of error) the same, but the data file is ~62x smaller. The same
> > affect
> > is seen for the post-processing script runtime.
> > 
> > Notably, without the patch enable, L1D cache miss rates are often
> > higher
> > than with, which we attribute to increased impact on the cache that
> > trapping into the kernel every 300 cycles has.
> > 
> > These results are given with `perf_cpu_time_max_percent=25`. When
> > tested
> > with `perf_cpu_time_max_percent=100` the size and time comparisons
> > are
> > more significant. Disabling throttling did not lead to obvious
> > improvements in the collected metrics, suggesting that the sampling
> > approach is sufficient to collect representative metrics.
> > 
> > Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
> > window (with and without the patch) suggests this approach might
> > work
> > for some counters. Using the same test script, it was possible to
> > identify
> > branch mispredicts correctly. However, whilst the patch is
> > functionally
> > correct, differences in the architectures may mean that this
> > approach it
> > enables does not apply as a means to collect per-function metrics
> > on x86.
> > 
> > Changes since RFC v2:
> >  - Rebased on v6.12-rc6.
> > 
> > Changes since RFC v1:
> >  - Rebased on v6.9-rc1.
> >  - Refactored from arm_pmu based extension to core feature
> >  - Added the ability to jitter the sample window based on feedback
> >    from Andi Kleen.
> >  - Modified perf tool to parse the "alt-period" and "alt-period-
> > jitter"
> >    terms in the event specification.
> > 
> > Ben Gainey (4):
> >   perf: Allow periodic events to alternate between two sample
> > periods
> >   perf: Allow adding fixed random jitter to the alternate sampling
> >     period
> >   tools/perf: Modify event parser to support alt-period term
> >   tools/perf: Modify event parser to support alt-period-jitter term
> > 
> >  include/linux/perf_event.h                    |  5 ++
> >  include/uapi/linux/perf_event.h               | 13 ++++-
> >  kernel/events/core.c                          | 47
> > +++++++++++++++++++
> >  tools/include/uapi/linux/perf_event.h         | 13 ++++-
> >  tools/perf/tests/attr.c                       |  2 +
> >  tools/perf/tests/attr.py                      |  2 +
> >  tools/perf/tests/attr/base-record             |  4 +-
> >  tools/perf/tests/attr/base-record-spe         |  2 +
> >  tools/perf/tests/attr/base-stat               |  4 +-
> >  tools/perf/tests/attr/system-wide-dummy       |  4 +-
> >  .../attr/test-record-alt-period-jitter-term   | 13 +++++
> >  .../tests/attr/test-record-alt-period-term    | 12 +++++
> >  tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
> >  tools/perf/util/parse-events.c                | 30 ++++++++++++
> >  tools/perf/util/parse-events.h                |  4 +-
> >  tools/perf/util/parse-events.l                |  2 +
> >  tools/perf/util/perf_event_attr_fprintf.c     |  1 +
> >  tools/perf/util/pmu.c                         |  2 +
> >  18 files changed, 157 insertions(+), 7 deletions(-)
> >  create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > jitter-term
> >  create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > term
> > 
> > --
> > 2.43.0
> >