[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+awFXT2j+HMeAy2RKnoBzb--+heFzJUoBZWp9iJevy1Dw@mail.gmail.com>
Date: Fri, 7 Feb 2025 09:16:28 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Andi Kleen <ak@...ux.intel.com>
Cc: namhyung@...nel.org, irogers@...gle.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, Arnaldo Carvalho de Melo <acme@...nel.org>
Subject: Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@...ux.intel.com> wrote:
>
> Dmitry Vyukov <dvyukov@...gle.com> writes:
>
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
>
> I'm curious how does this interact with the time / --time-quantum sort key?
>
> I assume it just works, but might be worth checking.
Yes, it seems to just work as one would assume. Things just combine as intended.
> It was intended to address some of these issues too.
What issue? Latency profiling? I wonder what approach you had in mind?
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> > - optimzing build latency
> > - optimizing server request latency
> > - optimizing ML training/inference latency
> > - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
>
> I would agree in the general case, but not if the time sort key
> is chosen with a suitable quantum. You can see how the parallelism
> changes over time then which is often a good enough proxy.
That's an interesting feature, but I don't see how it helps with latency.
How do you infer parallelism for slices? It looks like it just gives
the same wrong CPU profile, but multiple times (for each slice).
Also (1) user still needs to understand the default profile is wrong,
(2) be proficient with perf features, (3) manually aggregate lots of
data (time slicing increases amount of data in the profile X times),
(4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
program phase changed mid-second).
But it does open some interesting capabilities in combination with a
latency profile, e.g. the following shows how parallelism was changing
over time.
for perf make profile:
perf report -F time,latency,parallelism --time-quantum=1s
# Time Latency Parallelism
# ............ ........ ...........
#
1795957.0000 1.42% 1
1795957.0000 0.07% 2
1795957.0000 0.01% 3
1795957.0000 0.00% 4
1795958.0000 4.82% 1
1795958.0000 0.11% 2
1795958.0000 0.00% 3
...
1795964.0000 1.76% 2
1795964.0000 0.58% 4
1795964.0000 0.45% 1
1795964.0000 0.23% 10
1795964.0000 0.21% 6
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
Here it finally started running on more than 1 CPU.
> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
>
> You should add it to tips.txt then
>
> > .../callchain-overhead-calculation.txt | 5 +-
> > .../cpu-and-latency-overheads.txt | 85 ++++++++++++++
> > tools/perf/Documentation/perf-record.txt | 4 +
> > tools/perf/Documentation/perf-report.txt | 54 ++++++---
> > tools/perf/Documentation/tips.txt | 3 +
> > tools/perf/builtin-record.c | 20 ++++
> > tools/perf/builtin-report.c | 39 +++++++
> > tools/perf/ui/browsers/hists.c | 27 +++--
> > tools/perf/ui/hist.c | 104 ++++++++++++------
> > tools/perf/util/addr_location.c | 1 +
> > tools/perf/util/addr_location.h | 7 +-
> > tools/perf/util/event.c | 11 ++
> > tools/perf/util/events_stats.h | 2 +
> > tools/perf/util/hist.c | 90 ++++++++++++---
> > tools/perf/util/hist.h | 32 +++++-
> > tools/perf/util/machine.c | 7 ++
> > tools/perf/util/machine.h | 6 +
> > tools/perf/util/sample.h | 2 +-
> > tools/perf/util/session.c | 12 ++
> > tools/perf/util/session.h | 1 +
> > tools/perf/util/sort.c | 69 +++++++++++-
> > tools/perf/util/sort.h | 3 +-
> > tools/perf/util/symbol.c | 34 ++++++
> > tools/perf/util/symbol_conf.h | 8 +-
>
> We traditionally didn't do it, but in general test coverage
> of perf report is too low, so I would recommend to add some simple
> test case in the perf test scripts.
>
> -Andi
Powered by blists - more mailing lists