linux-kernel - Re: performance counter 20% error finding retired instruction count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090624151010.GA12799@elte.hu>
Date:	Wed, 24 Jun 2009 17:10:10 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Vince Weaver <vince@...ter.net>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Mackerras <paulus@...ba.org>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: performance counter 20% error finding retired instruction count


* Vince Weaver <vince@...ter.net> wrote:

> Hello
>
> As an aside, is it time to set up a dedicated Performance Counters
> for Linux mailing list?   (Hereafter referred to as p10c7l to avoid
> confusion with the other implementations that have already taken
> all the good abbreviated forms of the concept).

('perfcounters' is the name of the subsystem/feature and it's 
unique.)

> [...]  If/when the infrastructure appears in a released kernel, 
> there's going to be a lot of chatter by people who use performance 
> counters and suddenly find they are stuck with a huge step 
> backwards in functionality.  And asking Fortran programmers to 
> provide kernel patches probably won't be a productive response.  
> But I digress.
>
> I was trying to get an exact retired instruction count from 
> p10c7l. I am using the test million.s, available here
>
>  ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
>
> It should count exactly one million instructions.
>
> Tests with valgrind and qemu show that it does.
>
> Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
> all give the proper result:
>
> tobler:~% pfmon -e retired_instructions ./million
> 1000002 RETIRED_INSTRUCTIONS
>
>    ( it is 1,000,002 +/-2 because on most x86 architectures retired
>      instruction count includes any hardware interrupts that might
>      happen at the time.  It woud be a great feature if p10c7l
>      could add some way of gathering the per-process hardware
>      instruction count statistic to help quantify that).
>
> Yet with perf on the same Athlon32 machine (using
> kernel 2.6.30-03984-g45e3e19) gives:
>
> tobler:~%perf stat ./million
>
>  Performance counter stats for './million':
>
>        1.519366  task-clock-ticks     #       0.835 CPU utilization factor
>               3  context-switches     #       0.002 M/sec
>               0  CPU-migrations       #       0.000 M/sec
>              53  page-faults          #       0.035 M/sec
>         2483822  cycles               #    1634.775 M/sec
>         1240849  instructions         #     816.689 M/sec # 0.500 per cycle
>          612685  cache-references     #     403.250 M/sec
>            3564  cache-misses         #       2.346 M/sec
>
>  Wall-clock time elapsed:     1.819226 msecs
>
> Running multiple times gives:
>    1240849
>    1257312
>    1242313
>
> Which is a varying error of at least 20% which isn't even 
> consistent.  Is this because of sampling?  The documentation 
> doesn't really warn about this as far as I can tell.
>
> Thanks for any help resolving this problem

Thanks for the question! There's still gaps in the documentation so 
let me explain the basics here:

'perf stat' counts the true cost of executing the command in 
question, including the costs of:

   fork()ing the task
   exec()-ing it
   the ELF loader resolving dynamic symbols
   the app hitting various pagefaults that instantiate its pagetables

etc.

Those operations are pretty 'noisy' on a typical CPU, with lots of 
cache effects, so the noise you see is real.

You can eliminate much of the noise by only counting user-space 
instructions, as much of the command startup cost is in 
kernel-space.

Running your test-app that way can be done the following way:

 $ perf stat --repeat 10 -e 0:1:u ./million

 Performance counter stats for './million' (10 runs):

        1002106  instructions           ( +-   0.015% )

    0.000599029  seconds time elapsed.

( note the --repeat feature of perf stat - it does a loop of command 
  executions and observes the noise and displays it. )

Those ~2100 instructions are executed by your app: as the ELF 
dynamic loader starts up your test-app.

If you have some tool that reports less than that then that tool is 
not being truthful about the true overhead of your application.

Also note that applications that only execute 1 million instructions 
are very, very rare - a modern CPU can execute billions of 
instructions, per second, per core.

So i usually test a reference app that is more realistic, that 
executes 1 billion instructions:

 $ perf stat --repeat 10 -e 0:1:u ./loop_1b_instructions

 Performance counter stats for './loop_1b_instructions' (10 runs):

     1000079797  instructions           ( +-   0.000% )

    0.239947420  seconds time elapsed.

the noise there is very low. (despite 230 milliseconds still being a 
very short runtime)

Hope this helps - thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/