[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110424061645.GA12013@radium.snc4.facebook.com>
Date: Sat, 23 Apr 2011 23:16:45 -0700
From: Arun Sharma <asharma@...com>
To: Ingo Molnar <mingo@...e.hu>
Cc: arun@...rma-home.net, Stephane Eranian <eranian@...gle.com>,
Arnaldo Carvalho de Melo <acme@...radead.org>,
linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>,
Lin Ming <ming.m.lin@...el.com>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>, eranian@...il.com,
Arun Sharma <asharma@...com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
>
> The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> cycles the CPU does nothing useful, because it is stalled on a
> cache-miss or some other condition.
Conceptually looks fine. I'd prefer a more precise name such as:
PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
retirement stalls).
In the example below:
==> foo.c <==
foo()
{
}
bar()
{
}
==> test.c <==
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define FNV_PRIME_32 16777619
#define FNV_OFFSET_32 2166136261U
uint32_t hash1(const char *s)
{
uint32_t hash = FNV_OFFSET_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_32; // Multiply by prime number found to work well
}
return hash;
}
#define FNV_PRIME_WEAK_32 100
#define FNV_OFFSET_WEAK_32 200
uint32_t hash2(const char *s)
{
uint32_t hash = FNV_OFFSET_WEAK_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_WEAK_32; // Multiply by prime number found to work well
}
return hash;
}
int main()
{
int r = random();
while(1) {
r++;
#ifdef HARD
if (hash1((const char *) &r) & 0x500)
#else
if (hash2((const char *) &r) & 0x500)
#endif
foo();
else
bar();
}
}
==> Makefile <==
all:
gcc -O2 test.c foo.c -UHARD -o test.easy
gcc -O2 test.c foo.c -DHARD -o test.hard
# perf stat -e cycles,instructions ./test.hard
^C
Performance counter stats for './test.hard':
3,742,855,848 cycles
4,179,896,309 instructions # 1.117 IPC
1.754804730 seconds time elapsed
# perf stat -e cycles,instructions ./test.easy
^C
Performance counter stats for './test.easy':
3,932,887,528 cycles
8,994,416,316 instructions # 2.287 IPC
1.843832827 seconds time elapsed
i.e. fixing the branch mispredicts could result in
a nearly 2x speedup for the program.
Looking at:
# perf stat -e cycles,instructions,branch-misses,cache-misses,RESOURCE_STALLS:ANY ./test.hard
^C
Performance counter stats for './test.hard':
3,413,713,048 cycles (scaled from 69.93%)
3,772,611,782 instructions # 1.105 IPC (scaled from 80.01%)
51,113,340 branch-misses (scaled from 80.01%)
12,370 cache-misses (scaled from 80.02%)
26,656,983 RESOURCE_STALLS:ANY (scaled from 69.99%)
1.626595305 seconds time elapsed
it's hard to spot the opporunity. On the other hand:
# ./analyze.py
Percent idle: 27%
Retirement Stalls: 82%
Backend Stalls: 0%
Frontend Stalls: 62%
Instruction Starvation: 62%
icache stalls: 0%
does give me a signal about where to look. The script below is
a quick and dirty hack. I haven't really validated it with
many workloads. I'm posting it here anyway hoping that it'd
result in better kernel support for these types of analyses.
Even if we cover this with various generic PERF_COUNT_*STALL events,
we'll still have a need for other events:
* Things that give info about instruction mixes.
Ratio of {loads, stores, floating point, branches, conditional branches}
to total instructions.
* Activity related to micro architecture specific caches
People using -funroll-loops may have a significant performance opportunity.
But it's hard to spot bottlenecks in the instruction decoder.
* Monitoring traffic on Hypertransport/QPI links
Like you observe, most people will not look at these events, so
focusing on getting the common events right makes sense. But I
still like access to all events (either via a mapping file or
a library such as libpfm4). Hiding them in "perf list" sounds
like a reasonable way of keeping complexity out.
-Arun
PS: branch-misses:pp was spot on for the example above.
#!/usr/bin/env python
from optparse import OptionParser
from itertools import izip, chain, repeat
from subprocess import Popen, PIPE
import re, struct
def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)
counter_re = re.compile('\s+(?P<count>\d+)\s+(?P<event>\S+)')
def sample(events):
cmd = 'perf stat --no-big-num -a'
ncounters = 4
groups = grouper(ncounters, events)
for g in groups:
# filter padding
g = [ e for e in g if e ]
cmd += ' -e ' + ','.join(g)
cmd += ' -- sleep ' + str(options.time)
process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
out, err = process.communicate()
ret = process.poll()
if ret: raise "Perf failed: " + err
ret = {}
for line in err.split('\n'):
m = counter_re.match(line)
if not m: continue
ret[m.group('event')] = long(m.group('count'))
return ret
def measure_cycles():
# disable C-states
f = open("/dev/cpu_dma_latency", "wb")
f.write(struct.pack("i", 0))
f.flush()
saved = options.time
options.time = 1 # one sec is sufficient to measure clock
cycles = sample(["cycles"])['cycles']
cycles /= options.time
f.close()
options.time = saved
return cycles
if __name__ == '__main__':
parser = OptionParser()
parser.add_option("-t", "--time", dest="time", default=1,
help="How long to sample events")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
(options, args) = parser.parse_args()
cycles_per_sec = measure_cycles()
c = sample(["cycles", "instructions", "UOPS_ISSUED:ANY:c=1", "UOPS_ISSUED:ANY:c=1:t=1",
"RESOURCE_STALLS:ANY", "UOPS_RETIRED:ANY:c=1:t=1",
"UOPS_EXECUTED:PORT015:t=1", "UOPS_EXECUTED:PORT234_CORE",
"UOPS_ISSUED:ANY:t=1", "UOPS_ISSUED:FUSED:t=1", "UOPS_RETIRED:ANY:t=1",
"L1I:CYCLES_STALLED"])
cycles = c["cycles"] * 1.0
cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
icache_stall_cycles = c["L1I:CYCLES_STALLED"]
# Cycle stall accounting
print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)
# Wasted work
uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
uops_retired = c["UOPS_RETIRED:ANY:t=1"]
uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]
print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists