lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 3 Oct 2017 19:37:30 +0200 From: Ingo Molnar <mingo@...nel.org> To: Arnaldo Carvalho de Melo <acme@...nel.org> Cc: linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org, Kan Liang <kan.liang@...el.com>, Adrian Hunter <adrian.hunter@...el.com>, Alexei Starovoitov <ast@...nel.org>, Andi Kleen <ak@...ux.intel.com>, He Kuang <hekuang@...wei.com>, Lukasz Odzioba <lukasz.odzioba@...el.com>, Namhyung Kim <namhyung@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Wang Nan <wangnan0@...wei.com>, Arnaldo Carvalho de Melo <acme@...hat.com> Subject: Re: [PATCH 6/8] perf top: Implement multithreading for perf_event__synthesize_threads * Arnaldo Carvalho de Melo <acme@...nel.org> wrote: > From: Kan Liang <kan.liang@...el.com> > > The proc files which is sorted with alphabetical order are evenly > assigned to several synthesize threads to be processed in parallel. > > For 'perf top', the threads number hard code to online CPU number. The > following patch will introduce an option to set it. > > For other perf tools, the thread number is 1. Because the process > function is not ready for multithreading, e.g. > process_synthesized_event. > > This patch series only support event synthesize multithreading for 'perf > top'. For other tools, it can be done separately later. Just to give some quick feedback: this is really nice stuff! Is anyone working on multi-threading 'perf record' (and the recording portion of 'perf top' perhaps)? Especially with complex, high-frequency profiling there's alot of SMP overhead coming from a single recording thread. If there was a single thread per CPU, and it truly only recorded the events from its own CPU, things would become a lot more scalable. For example, if we measure the current overhead of perf record of a (limited) parallel kernel build: triton:~/tip> perf stat --no-inherit --pre "make clean >/dev/null 2>&1" perf record -F 10000 make -j kernel ... [ perf record: Captured and wrote 5.124 MB perf.data (108400 samples) ] Performance counter stats for 'perf record -F 10000 make -j kernel': 183.582587 task-clock (msec) # 0.039 CPUs utilized 2,496 context-switches # 0.014 M/sec 157 cpu-migrations # 0.855 K/sec 6,649 page-faults # 0.036 M/sec 817,478,151 cycles # 4.453 GHz 416,641,913 stalled-cycles-frontend # 50.97% frontend cycles idle 1,018,336,301 instructions # 1.25 insn per cycle # 0.41 stalled cycles per insn 217,255,137 branches # 1183.419 M/sec 2,970,118 branch-misses # 1.37% of all branches 4.710378510 seconds time elapsed That's 1018336301 just to record 108400 samples, i.e. every sample takes 9,300 instructions to _record_. That's insanely high overhead from what is in essence a tracing utility. Even if I add "-B -N" to disable buildid generation (which is the worst offender), it's still very high overhead: [ perf record: Captured and wrote 5.585 MB perf.data ] Performance counter stats for 'perf record -B -N -F 10000 make -j kernel': 45.625321 task-clock (msec) # 0.009 CPUs utilized 2,950 context-switches # 0.065 M/sec 204 cpu-migrations # 0.004 M/sec 1,992 page-faults # 0.044 M/sec 193,127,853 cycles # 4.233 GHz 117,098,418 stalled-cycles-frontend # 60.63% frontend cycles idle 197,899,633 instructions # 1.02 insn per cycle # 0.59 stalled cycles per insn 41,221,863 branches # 903.487 M/sec 502,158 branch-misses # 1.22% of all branches 4.858962925 seconds time elapsed ... that's still 1,800+ instructions per event! As a comparison, ftrace has a tracing overhead of less than 100 instructions per event. Thanks, Ingo
Powered by blists - more mailing lists