linux-kernel - Re: [PATCH] perf events: Add stalled cycles generic event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110427111141.GB28993@elte.hu>
Date:	Wed, 27 Apr 2011 13:11:41 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Arun Sharma <asharma@...com>
Cc:	arun@...rma-home.net, Stephane Eranian <eranian@...gle.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Lin Ming <ming.m.lin@...el.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>, eranian@...il.com,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] perf events: Add stalled cycles generic event -
 PERF_COUNT_HW_STALLED_CYCLES


* Arun Sharma <asharma@...com> wrote:

>     cycles = c["cycles"] * 1.0
>     cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
>     cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
>     backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
>     icache_stall_cycles = c["L1I:CYCLES_STALLED"]
> 
>     # Cycle stall accounting
>     print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
>     print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
>     print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
>     print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
>     print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
>     print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)
> 
>     # Wasted work
>     uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
>     uops_retired = c["UOPS_RETIRED:ANY:t=1"]
>     uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]
> 
>     print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
>     print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)

Just an update: i started working on generalizing these events.

As a first step i'd like to introduce stall statistics in default 'perf stat' 
output, then as a second step offer more detailed modi of analysis (like your 
script).

As for the first, 'overview' step, i'd like to use one or two numbers only, to 
give people a general ballpark figure about how good the CPU is performing for 
a given workload.

Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good, 
primary "stall" indicator? This is similar to the "cycles-uops_executed" value 
in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE 
based): it counts cycles when there's no execution at all - not even 
speculative one.

This would cover a wide variety of 'stall' reasons: external latency, or 
stalling on lack of paralellism in the incoming instruction stream and most 
other stall reasons. So it would measure everything that moves the CPU away 
from 100% utilization.

Secondly, the 'speculative waste' proportion is probably pretty well captured 
via branch-misprediction counts - those are the primary source of filling the 
pipeline with useless work.

So in the most high-level view we could already print useful information via 
the introduction of a single new generic event:

   PERF_COUNT_HW_CPU_CYCLES_BUSY

and 'idle cycles' are "cycles-busy_cycles".

I have implemented preliminary support for this, and this is how the new 
'overview' output looks like currently. Firstly here's the output from a "bad" 
testcase (lots of branch-misses):

 $ perf stat ./branches 20 1

 Performance counter stats for './branches 20 1' (10 runs):

          9.829903 task-clock               #    0.972 CPUs utilized            ( +-  0.07% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
               111 page-faults              #    0.011 M/sec                    ( +-  0.09% )
        31,470,886 cycles                   #    3.202 GHz                      ( +-  0.06% )
         9,825,068 stalled-cycles           #   31.22% of all cycles are idle   ( +- 13.89% )
        27,868,090 instructions             #    0.89  insns per cycle        
                                            #    0.35  stalled cycles per insn  ( +-  0.02% )
         4,313,661 branches                 #  438.830 M/sec                    ( +-  0.02% )
         1,068,668 branch-misses            #   24.77% of all branches          ( +-  0.01% )

        0.010117520  seconds time elapsed  ( +-  0.14% )

The two important values are the "20.00% of all cycles are idle" and the 
"24.77% of all branches" missed values - both are high and indicative of 
trouble.

The fixed testcase shows:

 Performance counter stats for './branches 20 0' (100 runs):

          4.417987 task-clock               #    0.948 CPUs utilized            ( +-  0.10% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
               111 page-faults              #    0.025 M/sec                    ( +-  0.02% )
        14,135,368 cycles                   #    3.200 GHz                      ( +-  0.10% )
         1,939,275 stalled-cycles           #   13.72% of all cycles are idle   ( +-  4.99% )
        27,846,610 instructions             #    1.97  insns per cycle        
                                            #    0.07  stalled cycles per insn  ( +-  0.00% )
         4,309,228 branches                 #  975.383 M/sec                    ( +-  0.00% )
             3,992 branch-misses            #    0.09% of all branches          ( +-  0.26% )

        0.004660164  seconds time elapsed  ( +-  0.15% )

Both stall values are much lower and the instructions per cycle value doubled.

Here's another testcase, one that fills the pipeline near-perfectly:

 $ perf stat ./fill_1b

 Performance counter stats for './fill_1b':

       1874.601174 task-clock               #    0.998 CPUs utilized          
                 1 context-switches         #    0.000 M/sec                  
                 0 CPU-migrations           #    0.000 M/sec                  
               107 page-faults              #    0.000 M/sec                  
     6,009,321,149 cycles                   #    3.206 GHz                    
       212,795,827 stalled-cycles           #    3.54% of all cycles are idle 
    18,007,646,574 instructions             #    3.00  insns per cycle        
                                            #    0.01  stalled cycles per insn
     1,001,527,311 branches                 #  534.262 M/sec                  
            16,988 branch-misses            #    0.00% of all branches        

        1.878558311  seconds time elapsed

Here too both counts are very low.

The next step is to provide the tools to further analyze why the CPU is not 
utilized perfectly. I have implemented some preliminary code for that too, 
using generic cache events:

 $ perf stat --repeat 10 --detailed ./array-bad 

 Performance counter stats for './array-bad' (10 runs):

         50.552646 task-clock               #    0.992 CPUs utilized            ( +-  0.04% )
                 0 context-switches         #    0.000 M/sec                    ( +-  0.00% )
                 0 CPU-migrations           #    0.000 M/sec                    ( +-  0.00% )
             1,877 page-faults              #    0.037 M/sec                    ( +-  0.01% )
       142,802,193 cycles                   #    2.825 GHz                      ( +-  0.18% )  (22.55%)
        88,622,411 stalled-cycles           #   62.06% of all cycles are idle   ( +-  0.22% )  (34.97%)
        45,381,755 instructions             #    0.32  insns per cycle        
                                            #    1.95  stalled cycles per insn  ( +-  0.11% )  (46.94%)
         7,725,207 branches                 #  152.815 M/sec                    ( +-  0.05% )  (58.44%)
            29,788 branch-misses            #    0.39% of all branches          ( +-  1.06% )  (69.46%)
         8,421,969 L1-dcache-loads          #  166.598 M/sec                    ( +-  0.37% )  (70.06%)
         7,868,389 L1-dcache-load-misses    #   93.43% of all L1-dcache hits    ( +-  0.13% )  (58.28%)
         4,553,490 LLC-loads                #   90.074 M/sec                    ( +-  0.31% )  (44.49%)
         1,764,277 LLC-load-misses          #   34.900 M/sec                    ( +-  0.21% )  (9.98%)

        0.050973462  seconds time elapsed  ( +-  0.05% )

The --detailed flag is what activates wider counting. The "93.43% of all 
L1-dcache hits" is a giveaway indicator that this particular workload is 
primarily data-access limited and that much of it escapes into RAM as well.

Is this the direction you'd like to see perf stat to move into? Any comments, 
suggestions?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/