[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6bd86f62a3d2761d339d42140106cfd061adb87.camel@arm.com>
Date: Mon, 25 Nov 2024 17:05:07 +0000
From: Deepak Surti <Deepak.Surti@....com>
To: "irogers@...gle.com" <irogers@...gle.com>
CC: Ben Gainey <Ben.Gainey@....com>, "alexander.shishkin@...ux.intel.com"
<alexander.shishkin@...ux.intel.com>, Mark Barnett <Mark.Barnett@....com>,
James Clark <James.Clark@....com>, "adrian.hunter@...el.com"
<adrian.hunter@...el.com>, "ak@...ux.intel.com" <ak@...ux.intel.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mingo@...hat.com" <mingo@...hat.com>, "linux-perf-users@...r.kernel.org"
<linux-perf-users@...r.kernel.org>, "will@...nel.org" <will@...nel.org>, Mark
Rutland <Mark.Rutland@....com>, "peterz@...radead.org"
<peterz@...radead.org>, "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "acme@...nel.org" <acme@...nel.org>,
"jolsa@...nel.org" <jolsa@...nel.org>, "namhyung@...nel.org"
<namhyung@...nel.org>
Subject: Re: [PATCH v1 0/4] A mechanism for efficient support for per-function
metrics
On Wed, 2024-11-13 at 18:22 -0800, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@....com>
> wrote:
> >
> > This patch introduces the concept on an alternating sample rate to
> > perf
> > core and provides the necessary basic changes in the tools to
> > activate
> > that option.
> >
> > This patchset was original posted by Ben Gainey out for RFC back in
> > April,
> > the latest version of which can be found at
> > https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/
> > .
> > Going forward, I will be owning this.
> >
> > The primary use case for this change is to be able to enable
> > collecting
> > per-function performance metrics using the Arm PMU, as per the
> > following
> > approach:
> >
> > * Starting with a simple periodic sampling (hotspot) profile,
> > augment each sample with PMU counters accumulated over a short
> > window
> > up to the point the sample was take.
> > * For each sample, perform some filtering to improve attribution
> > of
> > the accumulated PMU counters (ensure they are attributed to a
> > single
> > function)
> > * For each function accumulate a total for each PMU counter so
> > that
> > metrics may be derived.
> >
> > Without modification, and sampling at a typical rate associated
> > with hotspot profiling (~1mS) leads to poor results. Such an
> > approach gives you a reasonable estimation of where the profiled
> > application is spending time for relatively low overhead, but the
> > PMU counters cannot easily be attributed to a single function as
> > the
> > window over which they are collected is too large. A modern CPU may
> > execute many millions of instructions over many thousands of
> > functions
> > within 1mS window. With this approach, the per-function metrics
> > tend
> > to trend to some average value across the top N functions in the
> > profile.
> >
> > In order to ensure a reasonable likelihood that the counters are
> > attributed to a single function, the sampling window must be rather
> > short; typically something in the order of a few hundred cycles
> > proves
> > well as tested on a range of aarch64 Cortex and Neoverse cores.
> >
> > As it stands, it is possible to achieve this with perf using a very
> > high
> > sampling rate (e.g ~300cy), but there are at least three major
> > concerns
> > with this approach:
> >
> > * For speculatively executing, out of order cores, can the results
> > be
> > accurately attributed to a give function or the given sample
> > window?
> > * A short sample window is not guaranteed to cover a single
> > function.
> > * The overhead of sampling every few hundred cycles is very high
> > and
> > is highly likely to cause throttling which is undesirable as it
> > leads
> > to patchy results; i.e. the profile alternates between periods
> > of
> > high frequency samples followed by longer periods of no samples.
> >
> > This patch does not address the first two points directly. Some
> > means
> > to address those are discussed on the RFC v2 cover letter. The
> > patch
> > focuses on addressing the final point, though happily this approach
> > gives us a way to perform basic filtering on the second point.
> >
> > The alternating sample period allows us to do two things:
> >
> > * We can control the risk of throttling and reduce overhead by
> > alternating between a long and short period. This allows us to
> > decouple the "periodic" sampling rate (as might be used for
> > hotspot
> > profiling) from the short sampling window needed for collecting
> > the PMU counters.
> > * The sample taken at the end of the long period can be otherwise
> > discarded (as the PMU data is not useful), but the
> > PERF_RECORD_CALLCHAIN information can be used to identify the
> > current
> > function at the start of the short sample window. This is useful
> > for filtering samples where the PMU counter data cannot be
> > attributed
> > to a single function.
>
>
Hi Ian,
> I think this is interesting. I'm a little concerned on the approach
> as
> I wonder if a more flexible mechanism could be had.
>
> One approach that wouldn't work would be to open high and low
> frequency events, or groups of events, then use BPF filters to try to
> replicate this approach by dropping most of the high frequency
> events.
> I don't think it would work as the high frequency sampling is likely
> going to trigger during the BPF filter execution, and the BPF filter
> would be too much overhead.
>
> Perhaps another approach is to change the perf event period with a
> new
> BPF helper function that's called where we do the perf event
> filtering. There's the overhead of running the BPF code, but the BPF
> code could allow you to instead of alternating between two periods
> allow you to alternate between an arbitrary number of them.
As you note, just using BPF to filter high frequency samples is
probably too much overhead.
But I think more generally the issue with BPF is that it would impose
additional permissions restrictions vs perf-only. That is to say, in
many cases BPF is restricted to root-only, whereas the approach of
perf-only would support application profiling for non-root users. For
our use-case in particular, this is more important than the theoretical
additional flexibility that BPF could bring.
Thanks,
Deepak
> Thanks,
> Ian
>
> > There are several reasons why it is desirable to reduce the
> > overhead and
> > risk of throttling:
> >
> > * PMU counter overflow typically causes an interrupt into the
> > kernel;
> > this affects program runtime, and can affect things like branch
> > prediction, cache locality and so on which can skew the
> > metrics.
> > * The very high sample rate produces significant amounts of data.
> > Depending on the configuration of the profiling session and
> > machine,
> > it is easily possible to produce many orders of magnitude more
> > data
> > which is costly for tools to post-process and increases the
> > chance
> > of data loss. This is especially relevant on larger core count
> > systems where it is very easy to produce massive recordings.
> > Whilst the kernel will throttle such a configuration,
> > which helps to mitigate a large portion of the bandwidth and
> > capture
> > overhead, it is not something that can be controlled for on a
> > per
> > event basis, or for non-root users, and because throttling is
> > controlled as a percentage of time its affects vary from
> > machine to
> > machine. AIUI throttling may also produce an uneven temporal
> > distribution of samples. Finally, whilst throttling does a good
> > job
> > at reducing the overall amount of data produced, it still leads
> > to
> > much larger captures than with this method; typically we have
> > observed 1-2 orders of magnitude larger captures.
> >
> > This patch set modifies perf core to support alternating between
> > two
> > sample_period values, providing a simple and inexpensive way for
> > tools
> > to separate out the sample window (time over which events are
> > counted) from the sample period (time between interesting samples).
> >
> > It is expected to be used with the cycle counter event, alternating
> > between a long and short period and subsequently discarding the
> > counter
> > data for samples with the long period. The combined long and short
> > period gives the overall sampling period, and the short sample
> > period
> > gives the sample window. The symbol taken from the sample at the
> > end of
> > the long period can be used by tools to ensure correct attribution
> > as
> > described previously. The cycle counter is recommended as it
> > provides
> > fair temporal distribution of samples as would be required for the
> > per-symbol sample count mentioned previously, and because the PMU
> > can
> > be programmed to overflow after a sufficiently short window (which
> > may
> > not be possible with software timer, for example). This patch does
> > not
> > restrict to only the cycle counter, it is possible there could be
> > other
> > novel uses based on different events, or more appropriate counters
> > on
> > other architectures. This patch set does not modify or otherwise
> > disable
> > the kernel's existing throttling behaviour; if a configuration is
> > given
> > that would lead high CPU usage, then throttling still occurs.
> >
> >
> > To test this a simple `perf script` based python script was
> > developed.
> > For a limited set of Arm PMU events it will post process a
> > `perf record`-ing and generate a table of metrics. Along side this
> > a
> > benchmark application was developed that rotates through a sequence
> > of different classes of behaviour that can be detected by the Arm
> > PMU
> > (eg. mispredicts, cache misses, different instruction mixes). The
> > path
> > through the benchmark can be rotated after each iteration so as to
> > ensure the results don't land on some lucky harmonic with the
> > sample
> > period. The script can be used with and without this patch allowing
> > comparison of the results. Testing was on Juno (A53+A57), N1SDP,
> > Gravaton 2 and 3. In addition this approach has been applied to a
> > few
> > of Arm's tools projects and has correctly identified improvements
> > and
> > regressions.
> >
> > Headline results from testing indicate that ~300 cycles sample
> > window
> > gives good results with or without this patch. Typical output on
> > N1SDP (Neoverse-N1)
> > for the provided benchmark when run as:
> >
> > perf record -T --sample-cpu --call-graph fp,4 --user-callchains
> > \
> > -k CLOCK_MONOTONIC_RAW \
> > -e '{cycles/period=999700,alt-
> > period=300/,instructions,branch-misses,cache-references,cache-
> > misses}:uS' \
> > benchmark 0 1
> >
> > perf script -s generate-function-metrics.py -- -s discard
> >
> > Looks like (reformatted for email brevity):
> >
> > Symbol # CPI BM/KI CM/KI %CM %CY %I
> > %BM %L1DA %L1DM
> > fp_divider_stalls 6553 4.9 0.0 0.0 0.0 41.8
> > 22.9 0.1 0.6 0.0
> > int_divider_stalls 4741 3.5 0.0 0.0 1.1 28.3
> > 21.5 0.1 1.9 0.2
> > isb 3414 20.1 0.2 0.0 0.4 17.6
> > 2.3 0.1 0.8 0.0
> > branch_mispredicts 1234 1.1 33.0 0.0 0.0 6.1 15.2
> > 99.0 71.6 0.1
> > double_to_int 694 0.5 0.0 0.0 0.6 3.4
> > 19.1 0.1 1.2 0.1
> > nops 417 0.3 0.2 0.0 2.8 1.9
> > 18.3 0.6 0.4 0.1
> > dcache_miss 185 3.6 0.4 184.7 53.8 0.7
> > 0.5 0.0 18.4 99.1
> >
> > (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000
> > Instruction,
> > CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
> > accesses that miss, %CY = Percentage of total cycles, %I =
> > Percentage
> > of total instructions, %BM = Percentage of total branch
> > mispredicts,
> > %L1DA = Percentage of total cache accesses, %L1DM = Percentage of
> > total
> > cache misses)
> >
> > When the patch is used, the resulting `perf.data` files are
> > typically
> > between 25-50x smaller than without, and take ~25x less time for
> > the
> > python script to post-process. For example, running the following:
> >
> > perf record -i -vvv -e
> > '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
> > perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS'
> > benchmark 0 1
> > perf record -i -vvv -e '{cycles/period=300/,instructions}:uS'
> > benchmark 0 1
> >
> > produces captures on N1SDP (Neoverse-N1) of the following sizes:
> >
> > * period=1000000: 2.601 MB perf.data (55780 samples), script
> > time = 0m0.362s
> > * period=1000: 283.749 MB perf.data (6162932 samples), script
> > time = 0m33.100s
> > * period=300: 304.281 MB perf.data (6614182 samples), script
> > time = 0m35.826s
> >
> > The "script time" is the user time from running "time perf script -
> > s generate-function-metrics.py"
> > on the recording. Similar processing times were observed for "time
> > perf report --stdio|cat"
> > as well.
> >
> > By comparison, with the patch active:
> >
> > perf record -i -vvv -e '{cycles/period=999700,alt-
> > period=300/,instructions}:uS' benchmark 0 1
> >
> > produces 4.923 MB perf.data (107512 samples), and script time =
> > 0m0.578s.
> > Which is as expected ~2x the size and ~2x the number of samples as
> > per
> > the period=1000000 recording. When compared to the period=300
> > recording,
> > the results from the provided post-processing script are (within
> > margin
> > of error) the same, but the data file is ~62x smaller. The same
> > affect
> > is seen for the post-processing script runtime.
> >
> > Notably, without the patch enable, L1D cache miss rates are often
> > higher
> > than with, which we attribute to increased impact on the cache that
> > trapping into the kernel every 300 cycles has.
> >
> > These results are given with `perf_cpu_time_max_percent=25`. When
> > tested
> > with `perf_cpu_time_max_percent=100` the size and time comparisons
> > are
> > more significant. Disabling throttling did not lead to obvious
> > improvements in the collected metrics, suggesting that the sampling
> > approach is sufficient to collect representative metrics.
> >
> > Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
> > window (with and without the patch) suggests this approach might
> > work
> > for some counters. Using the same test script, it was possible to
> > identify
> > branch mispredicts correctly. However, whilst the patch is
> > functionally
> > correct, differences in the architectures may mean that this
> > approach it
> > enables does not apply as a means to collect per-function metrics
> > on x86.
> >
> > Changes since RFC v2:
> > - Rebased on v6.12-rc6.
> >
> > Changes since RFC v1:
> > - Rebased on v6.9-rc1.
> > - Refactored from arm_pmu based extension to core feature
> > - Added the ability to jitter the sample window based on feedback
> > from Andi Kleen.
> > - Modified perf tool to parse the "alt-period" and "alt-period-
> > jitter"
> > terms in the event specification.
> >
> > Ben Gainey (4):
> > perf: Allow periodic events to alternate between two sample
> > periods
> > perf: Allow adding fixed random jitter to the alternate sampling
> > period
> > tools/perf: Modify event parser to support alt-period term
> > tools/perf: Modify event parser to support alt-period-jitter term
> >
> > include/linux/perf_event.h | 5 ++
> > include/uapi/linux/perf_event.h | 13 ++++-
> > kernel/events/core.c | 47
> > +++++++++++++++++++
> > tools/include/uapi/linux/perf_event.h | 13 ++++-
> > tools/perf/tests/attr.c | 2 +
> > tools/perf/tests/attr.py | 2 +
> > tools/perf/tests/attr/base-record | 4 +-
> > tools/perf/tests/attr/base-record-spe | 2 +
> > tools/perf/tests/attr/base-stat | 4 +-
> > tools/perf/tests/attr/system-wide-dummy | 4 +-
> > .../attr/test-record-alt-period-jitter-term | 13 +++++
> > .../tests/attr/test-record-alt-period-term | 12 +++++
> > tools/perf/tests/attr/test-record-dummy-C0 | 4 +-
> > tools/perf/util/parse-events.c | 30 ++++++++++++
> > tools/perf/util/parse-events.h | 4 +-
> > tools/perf/util/parse-events.l | 2 +
> > tools/perf/util/perf_event_attr_fprintf.c | 1 +
> > tools/perf/util/pmu.c | 2 +
> > 18 files changed, 157 insertions(+), 7 deletions(-)
> > create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > jitter-term
> > create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > term
> >
> > --
> > 2.43.0
> >
Powered by blists - more mailing lists