[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4XDJyvjiie3howF@google.com>
Date: Mon, 13 Jan 2025 17:51:35 -0800
From: Namhyung Kim <namhyung@...nel.org>
To: Dmitry Vyukov <dvyukov@...gle.com>
Cc: irogers@...gle.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
Hello,
On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
>
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the be best of my knowledge) does not allow to
> profile wall-clock time.
>
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
> - optimzing build latency
> - optimizing server request latency
> - optimizing ML training/inference latency
> - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
>
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
>
> Brief outline of the implementation:
> - add context switch collection during record
> - calculate number of threads running on CPUs (parallelism level)
> during report
> - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
Thanks for working on this, very interesting!
But I guess this implementation depends on cpu-cycles event and single
target process. Do you think if it'd work for system-wide profiling?
How do you define wall-clock overhead if the event counts something
different (like the number of L3 cache-misses)?
Also I'm not sure about the impact of context switch events which could
generate a lot of records that may end up with losing some of them. And
in that case the parallelism tracking would break.
>
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.
I understand your point but I think it has some limitation so maybe it's
better to put in a separate mode with special flags.
Thanks,
Namhyung
Powered by blists - more mailing lists