[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+b96xynNaZoBHP72-4tJM5Nzo3MBRW19_E7JsMKRWYCzg@mail.gmail.com>
Date: Tue, 14 Jan 2025 09:26:57 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: irogers@...gle.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@...nel.org> wrote:
>
> Hello,
>
> On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the be best of my knowledge) does not allow to
> > profile wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where wall-clock profile is needed:
> > - optimzing build latency
> > - optimizing server request latency
> > - optimizing ML training/inference latency
> > - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This patch adds wall-clock and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> > - add context switch collection during record
> > - calculate number of threads running on CPUs (parallelism level)
> > during report
> > - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
>
> Thanks for working on this, very interesting!
>
> But I guess this implementation depends on cpu-cycles event and single
> target process.
Not necessarily, it works the same for PMU events, e.g. again from perf make:
Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
Overhead Wallclock Command Symbol
5.34% 16.33% ld [.] bfd_elf_link_add_symbols
4.14% 14.99% ld [.] bfd_link_hash_traverse
3.53% 10.87% ld [.] __memmove_avx_unaligned_erms
1.61% 0.05% cc1 [k] native_queued_spin_lock_slowpath
1.47% 0.09% cc1 [k] __d_lookup_rcu
1.38% 0.55% sh [k] __percpu_counter_sum
1.22% 0.11% cc1 [k] __audit_filter_op
0.95% 0.08% cc1 [k] audit_filter_rules.isra.0
0.72% 0.47% sh [k] pcpu_alloc_noprof
And also it works well for multi-process targets, again most examples
in the patch are from kernel make.
> Do you think if it'd work for system-wide profiling?
Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
instead of PERF_RECORD_SWITCH which this patch handles.
It can be made to work, I just didn't want to complicate this patch more.
It can work as is, if the system activity includes mostly a single
task, e.g. you run something on command line, it's taking too long, so
you decide to do perf record -a in parallel to see what it's doing.
For system-wide profiling we do Google (GWP), it may need some
additional logic to make it useful. Namely, the system-wide
parallelism is not useful in multi-task contexts. But we can extend
report to allow us to calculate parallelism/wall-clock overhead for a
single multi-threaded process only. Or, if we auto-matically
post-process system-wide profile, then we can split the profile
per-process and calculate parallelism for each of them.
I am also not sure if we use perf report for GWP, or do manual trace
parsing. When/if this part is merged, we can think of the system-wide
story.
TODO: don't enable context-switch recording for -a mode.
> How do you define wall-clock overhead if the event counts something
> different (like the number of L3 cache-misses)?
It's the same. We calculate CPU overhead for these. Wallclock overhead
is the same, it's just the weights are adjusted proportionally.
Physical meaning is similar:
CPU overhead: how much L3 misses affect throughput/cpu consumption
Wallclock overhead: how much L3 misses affect latency
> Also I'm not sure about the impact of context switch events which could
> generate a lot of records that may end up with losing some of them. And
> in that case the parallelism tracking would break.
I've counted SAMPLE/SWITCH events on perf make:
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
813829
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
46249
And this is for:
perf record -e L1-dcache-load-misses,dTLB-load-misses make
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
1138615
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
115433
Doesn't pretend to be scientifically accurate, but gives some idea
about proportion.
So for the number of samples that's 5% and 10% respectively.
SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
ring size it's 3-6%.
FWIW I've also considered and started implementing a different
approach where the kernel would count parallelism level for each
context and write it out with samples:
https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
Then sched_in/out would need to do atomic inc/dec on a global counter.
Not sure how hard it is to make all corner cases work there, I dropped
it half way b/c the perf record post-processing looked like a better
approach.
Lost SWITCH events may be a problem. Perf report already detects and
prints a warning about lost events. Should we stop reporting
parallelism/wall-clock if any events are lost? I just don't see it in
practice.
Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
> > The feature is added on an equal footing with the existing CPU profiling
> > rather than a separate mode enabled with special flags. The reasoning is
> > that users may not understand the problem and the meaning of numbers they
> > are seeing in the first place, so won't even realize that they may need
> > to be looking for some different profiling mode. When they are presented
> > with 2 sets of different numbers, they should start asking questions.
>
> I understand your point but I think it has some limitation so maybe it's
> better to put in a separate mode with special flags.
I hate all options.
The 3rd option is to add a _mandary_ flag to perf record to ask for
profiling mode. Then the old "perf record" will say "nope, you need to
choose, here are your options and explanation of the options".
If we add it as a new separate mode, I fear it's doomed to be
unknown/unused and people will continue to not realize what they are
profiling, meaning of numbers, and continue to use wrong profiles for
the job.
I consider latency/wall-clock profiling as useful and as fundamental
as the current bandwidth profiling for parallel workloads (more or
less all workloads today). Yet we are in 2025 and no profilers out
there support latency profiling. Which suggests to me there is a
fundamental lack of understanding of the matter, so new optional mode
may not work well.
If you are interested for context, I used a similar mode (for pprof at
the time) to optimize TensorFlow latency (people mostly care about
latency). All existing profilers said it's 99% highly optimized and
parallelized matrix multiplication, we are done here. Yet, the
wallclock profile said, nope, it's 45% Python interpreter, which
resulted in this fix:
https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
Full story and examples of these profiles are here:
https://github.com/dvyukov/perf-load
Powered by blists - more mailing lists