lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 3 Oct 2017 19:37:30 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Arnaldo Carvalho de Melo <acme@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org,
        Kan Liang <kan.liang@...el.com>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>, He Kuang <hekuang@...wei.com>,
        Lukasz Odzioba <lukasz.odzioba@...el.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Wang Nan <wangnan0@...wei.com>,
        Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [PATCH 6/8] perf top: Implement multithreading for
 perf_event__synthesize_threads


* Arnaldo Carvalho de Melo <acme@...nel.org> wrote:

> From: Kan Liang <kan.liang@...el.com>
> 
> The proc files which is sorted with alphabetical order are evenly
> assigned to several synthesize threads to be processed in parallel.
> 
> For 'perf top', the threads number hard code to online CPU number. The
> following patch will introduce an option to set it.
> 
> For other perf tools, the thread number is 1. Because the process
> function is not ready for multithreading, e.g.
> process_synthesized_event.
> 
> This patch series only support event synthesize multithreading for 'perf
> top'. For other tools, it can be done separately later.

Just to give some quick feedback: this is really nice stuff!

Is anyone working on multi-threading 'perf record' (and the recording portion of 
'perf top' perhaps)?

Especially with complex, high-frequency profiling there's alot of SMP overhead 
coming from a single recording thread. If there was a single thread per CPU, and 
it truly only recorded the events from its own CPU, things would become a lot more 
scalable.

For example, if we measure the current overhead of perf record of a (limited) 
parallel kernel build:

  triton:~/tip> perf stat --no-inherit --pre "make clean >/dev/null 2>&1" perf record -F 10000 make -j kernel
  ...
  [ perf record: Captured and wrote 5.124 MB perf.data (108400 samples) ]

 Performance counter stats for 'perf record -F 10000 make -j kernel':

        183.582587      task-clock (msec)         #    0.039 CPUs utilized          
             2,496      context-switches          #    0.014 M/sec                  
               157      cpu-migrations            #    0.855 K/sec                  
             6,649      page-faults               #    0.036 M/sec                  
       817,478,151      cycles                    #    4.453 GHz                    
       416,641,913      stalled-cycles-frontend   #   50.97% frontend cycles idle   
     1,018,336,301      instructions              #    1.25  insn per cycle         
                                                  #    0.41  stalled cycles per insn
       217,255,137      branches                  # 1183.419 M/sec                  
         2,970,118      branch-misses             #    1.37% of all branches        

       4.710378510 seconds time elapsed

That's 1018336301 just to record 108400 samples, i.e. every sample takes 9,300 
instructions to _record_. That's insanely high overhead from what is in essence a 
tracing utility.


Even if I add "-B -N" to disable buildid generation (which is the worst offender), 
it's still very high overhead:

 [ perf record: Captured and wrote 5.585 MB perf.data ]

 Performance counter stats for 'perf record -B -N -F 10000 make -j kernel':

         45.625321      task-clock (msec)         #    0.009 CPUs utilized          
             2,950      context-switches          #    0.065 M/sec                  
               204      cpu-migrations            #    0.004 M/sec                  
             1,992      page-faults               #    0.044 M/sec                  
       193,127,853      cycles                    #    4.233 GHz                    
       117,098,418      stalled-cycles-frontend   #   60.63% frontend cycles idle   
       197,899,633      instructions              #    1.02  insn per cycle         
                                                  #    0.59  stalled cycles per insn
        41,221,863      branches                  #  903.487 M/sec                  
           502,158      branch-misses             #    1.22% of all branches        

       4.858962925 seconds time elapsed

... that's still 1,800+ instructions per event!

As a comparison, ftrace has a tracing overhead of less than 100 instructions per 
event.

Thanks,

	Ingo

Powered by blists - more mailing lists