[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+b9B6=ijtGfrtwQFq44ceiRTmv53pC9iqgNsbeg=cFr4Q@mail.gmail.com>
Date: Wed, 5 Feb 2025 17:37:25 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: irogers@...gle.com, acme@...nel.org, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 0/8] perf report: Add latency and parallelism profiling
On Tue, 4 Feb 2025 at 05:21, Namhyung Kim <namhyung@...nel.org> wrote:
>
> Hello,
>
> On Mon, Feb 03, 2025 at 03:30:35PM +0100, Dmitry Vyukov wrote:
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the be best of my knowledge) does not allow to
> > profile wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> > - optimzing build latency
> > - optimizing server request latency
> > - optimizing ML training/inference latency
> > - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This series add latency and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> > - add context switch collection during record
> > - calculate number of threads running on CPUs (parallelism level)
> > during report
> > - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
> >
> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
> >
> > Changes in v4:
> > - added "Shrink struct hist_entry size" commit
> > - rebased to perf-tools-next HEAD
> >
> > Changes in v3:
> > - rebase and split into patches
> > - rename 'wallclock' to 'latency' everywhere
> > - don't enable latency profiling by default,
> > instead add record/report --latency flag
> >
> > Dmitry Vyukov (8):
> > perf report: Add machine parallelism
> > perf report: Add parallelism sort key
> > perf report: Switch filtered from u8 to u16
> > perf report: Add parallelism filter
> > perf report: Add latency output field
> > perf report: Add --latency flag
> > perf report: Add latency and parallelism profiling documentation
> > perf hist: Shrink struct hist_entry size
>
> I found that stdio output is broken that latency is too high. TUI is fine.
>
> $ ./perf report --latency -s dso --stdio
> ...
> #
> # Latency Overhead Shared Object
> # ........ ........ ....................
> #
> 57669.15% 57.67% [kernel.kallsyms]
> 42330.85% 42.33% ld-linux-x86-64.so.2
Fixed in v5.
> Thanks,
> Namhyung
>
> >
> > .../callchain-overhead-calculation.txt | 5 +-
> > .../cpu-and-latency-overheads.txt | 85 ++++++++++++++++++
> > tools/perf/Documentation/perf-report.txt | 49 ++++++----
> > tools/perf/Documentation/tips.txt | 3 +
> > tools/perf/builtin-record.c | 20 +++++
> > tools/perf/builtin-report.c | 39 ++++++++
> > tools/perf/ui/browsers/hists.c | 27 +++---
> > tools/perf/ui/hist.c | 64 +++++++++----
> > tools/perf/util/addr_location.c | 1 +
> > tools/perf/util/addr_location.h | 7 +-
> > tools/perf/util/event.c | 11 +++
> > tools/perf/util/events_stats.h | 2 +
> > tools/perf/util/hist.c | 90 +++++++++++++++----
> > tools/perf/util/hist.h | 32 +++++--
> > tools/perf/util/machine.c | 7 ++
> > tools/perf/util/machine.h | 6 ++
> > tools/perf/util/sample.h | 2 +-
> > tools/perf/util/session.c | 12 +++
> > tools/perf/util/session.h | 1 +
> > tools/perf/util/sort.c | 69 ++++++++++++--
> > tools/perf/util/sort.h | 3 +-
> > tools/perf/util/symbol.c | 34 +++++++
> > tools/perf/util/symbol_conf.h | 8 +-
> > 23 files changed, 502 insertions(+), 75 deletions(-)
> > create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt
> >
> >
> > base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
> > --
> > 2.48.1.362.g079036d154-goog
> >
Powered by blists - more mailing lists