linux-kernel - Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACT4Y+an1LSY15f9MS_vnbaaeeqMf+k4-Dqqfu-zwcUAHFNk=w@mail.gmail.com>
Date: Wed, 8 Jan 2025 09:34:28 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: namhyung@...nel.org, irogers@...gle.com
Cc: linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling

On Wed, 8 Jan 2025 at 09:24, Dmitry Vyukov <dvyukov@...gle.com> wrote:
>
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
>
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the best of my knowledge) does not allow profile
> wall-clock time.
>
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
>
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
>
> Brief outline of the implementation:
>  - add context switch collection during record
>  - calculate number of threads running on CPUs (parallelism level)
>    during report
>  - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
>
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.

Hi folks,

Am I missing something and this is possible/known already?

I understand this is a large change, and I am open to comments.
I've also uploaded it to gerrit if you prefer to review there:
https://linux-review.git.corp.google.com/c/linux/kernel/git/torvalds/linux/+/25608

You may also checkout that branch and try it locally. It works on older kernels.

What of this is testable within the current testing framework?
Also how do I run tests? I failed to figure it out.

Btw, the profile example in the docs is from a real kernel build on my machine.
You can see how misleading the current profile is wrt latency.

Or you can see what takes time in the perf make itself.
(despite -j128, 73% of time was spent with 1 running thread,
only a few percent of time was spent with high parallelism).

  Wallclock  Overhead           Parallelism / Command
-    73.64%     6.96%           1
   +    28.53%     2.70%           cc1
   +    17.93%     1.69%           python3
   +    10.79%     1.02%           ld
-     7.49%     1.42%           2
   +     4.26%     0.81%           cc1
   +     0.72%     0.14%           ld
   +     0.68%     0.13%           cc1plus
...
-     1.33%    15.74%           125
   +     1.23%    14.50%           cc1
   +     0.03%     0.33%           gcc
   +     0.03%     0.32%           sh