netdev - Re: perf measure for stalled cycles per instruction on newer Intel processors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJ3xEMgKfgbpxzxx595bG=bRM-ETm4vJfWALR3p-wVzzcHxHSw@mail.gmail.com>
Date:   Sun, 18 Oct 2020 20:42:28 +0300
From:   Or Gerlitz <gerlitz.or@...il.com>
To:     Andi Kleen <andi@...stfloor.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Brendan Gregg <bgregg@...flix.com>,
        Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: perf measure for stalled cycles per instruction on newer Intel processors

On Thu, Oct 15, 2020 at 9:33 PM Andi Kleen <andi@...stfloor.org> wrote:
> On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> > Earlier Intel processors (e.g E5-2650) support the more of classical
> > two stall events (for backend and frontend [1]) and then perf shows
> > the nice measure of stalled cycles per instruction - e.g here where we
> > have IPC of 0.91 and CSPI (see [2]) of 0.68:
>
> Don't use it. It's misleading on a out-of-order CPU because you don't
> know if it's actually limiting anything.
>
> If you want useful bottleneck data use --topdown.

So running again, this time with the below params, I got this output
where all the right most column is colored red. I wonder what can be
said on the amount/ratio of stalls for this app - if you can maybe recommend
some posts of yours to better understand that, I saw some comment in the
perf-stat man page and some lwn article but wasn't really able to figure it out.

FWIW, the kernel is 5.5.7-100.fc30.x86_64 and the CPU E5-2650 0

$ perf stat  --topdown -a  taskset -c 0 $APP

[...]

 Performance counter stats for 'system wide':

                                    retiring      bad speculation
 frontend bound        backend bound
S0-D0-C0           1                24.9%                 1.1%
       16.1%                57.9%
S0-D0-C1           1                16.3%                 1.3%
       17.3%                65.1%
S0-D0-C2           1                17.0%                 1.2%
       15.3%                66.5%
S0-D0-C3           1                18.3%                 0.8%
        8.2%                72.8%
S0-D0-C4           1                18.1%                 0.8%
        8.5%                72.6%
S0-D0-C5           1                17.6%                 0.8%
       10.0%                71.6%
S0-D0-C6           1                18.3%                 0.7%
        7.4%                73.6%
S0-D0-C7           1                15.4%                 1.4%
       22.1%                61.2%
S1-D0-C0           1                15.9%                 1.4%
       16.4%                66.3%
S1-D0-C1           1                21.9%                 2.6%
       16.9%                58.5%
S1-D0-C2           1                20.8%                 3.7%
       17.1%                58.4%
S1-D0-C3           1                17.8%                 1.0%
        9.2%                72.1%
S1-D0-C4           1                17.8%                 1.0%
        9.0%                72.2%
S1-D0-C5           1                17.8%                 1.0%
        9.0%                72.2%
S1-D0-C6           1                17.4%                 1.4%
       12.8%                68.4%
S1-D0-C7           1                23.6%                 4.3%
       17.2%                55.0%

      13.341823591 seconds time elapsed

while running with perf stat -d gives this:

$ perf stat   -d taskset -c 0 $APP

Performance counter stats for 'taskset -c 0 ./main.gcc9.3.1':

         15,075.30 msec task-clock                #    0.900 CPUs
utilized
               199      context-switches          #    0.013 K/sec
                 1      cpu-migrations            #    0.000 K/sec
           117,987      page-faults               #    0.008 M/sec
    40,907,365,540      cycles                    #    2.714 GHz
    26,431,604,986      stalled-cycles-frontend   #   64.61% frontend
cycles idle
    21,734,615,045      stalled-cycles-backend    #   53.13% backend
cycles idle
    35,339,765,469      instructions              #    0.86  insn per
cycle
                                                  #    0.75  stalled
cycles per insn