linux-kernel - Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+bA=NdTBjJRCGPr-HPe0=qBKfz5=afr3d+HJ_xLZZ_mKA@mail.gmail.com>
Date: Fri, 24 Jan 2025 11:46:01 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: irogers@...gle.com, linux-perf-users@...r.kernel.org, 
	linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling

What are the next steps for this patch?

I don't have any action items on my side. Am I missing any? Ready to be merged?



On Sun, 19 Jan 2025 at 11:50, Dmitry Vyukov <dvyukov@...gle.com> wrote:
>
> On Wed, 15 Jan 2025 at 01:30, Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@...nel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > > > There are two notions of time: wall-clock time and CPU time.
> > > > > For a single-threaded program, or a program running on a single-core
> > > > > machine, these notions are the same. However, for a multi-threaded/
> > > > > multi-process program running on a multi-core machine, these notions are
> > > > > significantly different. Each second of wall-clock time we have
> > > > > number-of-cores seconds of CPU time.
> > > > >
> > > > > Currently perf only allows to profile CPU time. Perf (and all other
> > > > > existing profilers to the be best of my knowledge) does not allow to
> > > > > profile wall-clock time.
> > > > >
> > > > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > > > These profiles are complementary and are not interchangeable.
> > > > > Examples of where wall-clock profile is needed:
> > > > >  - optimzing build latency
> > > > >  - optimizing server request latency
> > > > >  - optimizing ML training/inference latency
> > > > >  - optimizing running time of any command line program
> > > > >
> > > > > CPU profile is useless for these use cases at best (if a user understands
> > > > > the difference), or misleading at worst (if a user tries to use a wrong
> > > > > profile for a job).
> > > > >
> > > > > This patch adds wall-clock and parallelization profiling.
> > > > > See the added documentation and flags descriptions for details.
> > > > >
> > > > > Brief outline of the implementation:
> > > > >  - add context switch collection during record
> > > > >  - calculate number of threads running on CPUs (parallelism level)
> > > > >    during report
> > > > >  - divide each sample weight by the parallelism level
> > > > > This effectively models that we were taking 1 sample per unit of
> > > > > wall-clock time.
> > > >
> > > > Thanks for working on this, very interesting!
> > > >
> > > > But I guess this implementation depends on cpu-cycles event and single
> > > > target process.
> > >
> > > Not necessarily, it works the same for PMU events, e.g. again from perf make:
> >
> > Of course it would work with other events.  But the thing is how to
> > interprete the result IMHO.
> >
> > >
> > > Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> > > Overhead  Wallclock  Command          Symbol
> > >    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
> > >    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
> > >    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
> > >    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
> > >    1.47%      0.09%  cc1              [k] __d_lookup_rcu
> > >    1.38%      0.55%  sh               [k] __percpu_counter_sum
> > >    1.22%      0.11%  cc1              [k] __audit_filter_op
> > >    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
> > >    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
> > >
> > > And also it works well for multi-process targets, again most examples
> > > in the patch are from kernel make.
> >
> > It's the single-origin case.  I'm curious about the multiple-origin
> > cases like profiling two or more existing processes together.  How do
> > you determine the initial parallelism?
>
> What is the "multiple-origin case"? How is it different from multiple processes?
>
> > Also I think the parallelism is defined in a process (group).  Does it
> > still make sense in system-wide profiling?
>
> I've tried to answer this here:
>
> > It can work as is, if the system activity includes mostly a single
> > task, e.g. you run something on command line, it's taking too long, so
> > you decide to do perf record -a in parallel to see what it's doing.
> >
> > For system-wide profiling we do Google (GWP), it may need some
> > additional logic to make it useful. Namely, the system-wide
> > parallelism is not useful in multi-task contexts. But we can extend
> > report to allow us to calculate parallelism/wall-clock overhead for a
> > single multi-threaded process only. Or, if we auto-matically
> > post-process system-wide profile, then we can split the profile
> > per-process and calculate parallelism for each of them.
>
> Additionally we could add a process filter that will calculate
> parallelism only with the given set of processes (specified by PIDs or
> comm names). I.e. one profiled the whole system, but then says "I am
> really interested only in processes X and Y".
>
>
>
> > > > Do you think if it'd work for system-wide profiling?
> > >
> > > Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> > > instead of PERF_RECORD_SWITCH which this patch handles.
> > >
> > > It can be made to work, I just didn't want to complicate this patch more.
> >
> > Ok.
> >
> > > It can work as is, if the system activity includes mostly a single
> > > task, e.g. you run something on command line, it's taking too long, so
> > > you decide to do perf record -a in parallel to see what it's doing.
> > >
> > > For system-wide profiling we do Google (GWP), it may need some
> > > additional logic to make it useful. Namely, the system-wide
> > > parallelism is not useful in multi-task contexts. But we can extend
> > > report to allow us to calculate parallelism/wall-clock overhead for a
> > > single multi-threaded process only. Or, if we auto-matically
> > > post-process system-wide profile, then we can split the profile
> > > per-process and calculate parallelism for each of them.
> > >
> > > I am also not sure if we use perf report for GWP, or do manual trace
> > > parsing. When/if this part is merged, we can think of the system-wide
> > > story.
> > >
> > > TODO: don't enable context-switch recording for -a mode.
> > >
> > >
> > > > How do you define wall-clock overhead if the event counts something
> > > > different (like the number of L3 cache-misses)?
> > >
> > > It's the same. We calculate CPU overhead for these. Wallclock overhead
> > > is the same, it's just the weights are adjusted proportionally.
> >
> > I got this part but ...
> >
> > >
> > > Physical meaning is similar:
> > > CPU overhead: how much L3 misses affect throughput/cpu consumption
> > > Wallclock overhead: how much L3 misses affect latency
> >
> > I'm not sure if it's correct.  The event would tell where you get the L3
> > miss mostly.  It doesn't necessarily mean latency.  IOW more L3 miss
> > doesn't mean more CPU time.  I guess that's why perf doesn't say it as
> > "CPU overhead".
>
> Yes, I agree, it does not directly translate to increased CPU
> consumption or latency.
> But the point is: whatever was the effect on CPU consumption, it's
> rescaled to latency.
>
>
> > Hmm.. maybe you could rename it differently?
>
> You mean the "perf report" help page, right?
>
> Yes, strictly speaking it's not "CPU overhead" and "Wall-clock
> overhead", but rather "Overhead in CPU time" and "Overhead in
> wall-clock time". Does it look better to you?
>
>
> > > > Also I'm not sure about the impact of context switch events which could
> > > > generate a lot of records that may end up with losing some of them.  And
> > > > in that case the parallelism tracking would break.
> > >
> > > I've counted SAMPLE/SWITCH events on perf make:
> > >
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > 813829
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > 46249
> > >
> > > And this is for:
> > > perf record -e L1-dcache-load-misses,dTLB-load-misses make
> > >
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > 1138615
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > 115433
> > >
> > > Doesn't pretend to be scientifically accurate, but gives some idea
> > > about proportion.
> > > So for the number of samples that's 5% and 10% respectively.
> >
> > I think it depends on the workload.  In a very loaded system with short
> > timeslice or something, you could get a lot more sched switches.  In
> > your workload, `make` may have a long single-threaded portion.
>
> Unfortunately this is likely to be true.
>
> > > SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> > > ring size it's 3-6%.
> > >
> > > FWIW I've also considered and started implementing a different
> > > approach where the kernel would count parallelism level for each
> > > context and write it out with samples:
> > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> >
> > It seems you counted parallelism in an event (including inherited ones).
> > But usually perf tool opens a separate FD on each CPU for an event (like
> > cycles), then you need to adjust the value across all CPUS.  I'm not
> > sure if it's doable in this approach.
>
> Right, this is probably when I gave up on that approach -- when I
> observed the counter has a value of 1 at most.
>
> > > Not sure how hard it is to make all corner cases work there, I dropped
> > > it half way b/c the perf record post-processing looked like a better
> > > approach.
> >
> > Probably.
> >
> > >
> > > Lost SWITCH events may be a problem. Perf report already detects and
> > > prints a warning about lost events. Should we stop reporting
> > > parallelism/wall-clock if any events are lost? I just don't see it in
> > > practice.
> >
> > It happens quite often when there are many activities.  Probably we can
> > consider stopping when it detects some lost records.
> >
> >
> > > Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
> >
> > It'd be nice, but given the rate of sched switch I guess it'd most
> > likely contain some.
>
> What's an easy way to get lost events? Normally on my machine I never see them.
> Is there some parameter to make the buffer very small, or something?
>
>
> > > > > The feature is added on an equal footing with the existing CPU profiling
> > > > > rather than a separate mode enabled with special flags. The reasoning is
> > > > > that users may not understand the problem and the meaning of numbers they
> > > > > are seeing in the first place, so won't even realize that they may need
> > > > > to be looking for some different profiling mode. When they are presented
> > > > > with 2 sets of different numbers, they should start asking questions.
> > > >
> > > > I understand your point but I think it has some limitation so maybe it's
> > > > better to put in a separate mode with special flags.
> > >
> > > I hate all options.
> > >
> > > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > > profiling mode. Then the old "perf record" will say "nope, you need to
> > > choose, here are your options and explanation of the options".
> > >
> > > If we add it as a new separate mode, I fear it's doomed to be
> > > unknown/unused and people will continue to not realize what they are
> > > profiling, meaning of numbers, and continue to use wrong profiles for
> > > the job.
> >
> > I think people will find it out if it really works well! :)
>
> Yes, but it may take 10 years for the word to spread :)
> I am afraid the main problem is that people don't know they need to be
> looking for something. We had 20 years of widespread multi-CPU/core
> systems and no profiler supporting latency profiling.
>
> But as I said, I am open to options (at least I will know it's there
> for my purposes :)).
>
>
> > > I consider latency/wall-clock profiling as useful and as fundamental
> > > as the current bandwidth profiling for parallel workloads (more or
> > > less all workloads today). Yet we are in 2025 and no profilers out
> > > there support latency profiling. Which suggests to me there is a
> > > fundamental lack of understanding of the matter, so new optional mode
> > > may not work well.
> >
> > I'm open to make it default if we can make sure it works well and have a
> > clear interpretation of the result.
> >
> > >
> > > If you are interested for context, I used a similar mode (for pprof at
> > > the time) to optimize TensorFlow latency (people mostly care about
> > > latency). All existing profilers said it's 99% highly optimized and
> > > parallelized matrix multiplication, we are done here. Yet, the
> > > wallclock profile said, nope, it's 45% Python interpreter, which
> > > resulted in this fix:
> > > https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> > > Full story and examples of these profiles are here:
> > > https://github.com/dvyukov/perf-load
> >
> > Great examples!  I'll take a look later.
> >
> > Thanks,
> > Namhyung
> >