linux-kernel - [PATCH RFC V4 0/6] perf top optimization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1506696477-146932-1-git-send-email-kan.liang@intel.com>
Date:   Fri, 29 Sep 2017 07:47:51 -0700
From:   kan.liang@...el.com
To:     acme@...nel.org, peterz@...radead.org, mingo@...hat.com,
        linux-kernel@...r.kernel.org
Cc:     jolsa@...nel.org, namhyung@...nel.org, adrian.hunter@...el.com,
        lukasz.odzioba@...el.com, wangnan0@...wei.com, hekuang@...wei.com,
        ast@...nel.org, ak@...ux.intel.com, Kan Liang <kan.liang@...el.com>
Subject: [PATCH RFC V4 0/6] perf top optimization

From: Kan Liang <kan.liang@...el.com>

The patch series intends to fix the severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system.
perf top costs a few minutes to show the result, which is
unacceptable.
With the patch series applied, the latency will reduces to
several seconds.

machine__synthesize_threads and perf_top__mmap_read costs most of
the perf top time (> 99%).
Patch 1-4 do the optimization for machine__synthesize_threads.
Patch 5-6 does the optimization for perf_top__mmap_read.

Optimization for machine__synthesize_threads
  - Multithreading the whole process.
  - The threads number is set to the max online CPU# by default.
    User can change the threads number through the new option.
  - Introduces hashtable for machine threads to reduce the lock
    contention.
  - The optimization can also benefit other platforms and other
    perf tools, like perf record. But this patch series doesn't
    do the optimization for other tools. It can be done later
    separately.
  - With this optimization applied, there is a 1.56x speedup in
    Knights Mill with heavy workload.

Optimization for perf_top__mmap_read
  - switch to backward overwrite mode
    For non overwrite mode, it tries to read everything in the ring buffer
    and does not check the messup. Once there are lots of samples delivered
    shortly, the processing time could be very long.
    Considering the real time requirement for perf top, it should switch
    to backward overwrite mode.
  - With this optimization applied, there is a 8.98x speedup in
    Knights Mill with heavy workload.
  - However, the latency of perf_top__mmap_read is still higher than the
    default perf top fresh time (2s) in Knights Mill with heavy workload.
    A check is introduced to give some hints to reduce the overhead.

The source code is also available at
https://github.com/kliang2/perf.git perf_top_opt

Here are perf top latency test result on Knights Mill and Skylake server

The heavy workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf top"

The latency period is the time between perf top launched and the first
profiling result shown.

- Latency on Knights Mill (272 CPUs)

Original(s)     With patch(s)   Speedup
272.68          40.89           6.67x

- Latency on Skylake server (192 CPUs)

Original(s)     With patch(s)   Speedup
12.28           2.05            5.99x

Changes since V3:
 - Switch to backward overwrite mode (jirka)
   The backward mode can avoid some problems found in forward,
   but the performance drops which compared with V3 for Knights Mill.
   Introduce a new patch to check the latency of perf_top__mmap_read.
 - Handle the thread_nr = 1 specially (jirka)

Changes since V2:
 - patches 1 and 2 for V2 for hashtable and scandir have been merged.
 - patches 3, 4 and 7 for V2 are droped. Because the optimization code
   doesn't touch those codes. The protection is not needed. (Arnaldo & jirka)
 - Using mutex wrappers for multithread only lock. (jirka)
 - Move struct synthesize_threads_arg to event.c (jirka)

Changes since V1:
 - Patch 1: machine threads and hashtable related renaming (Arnaldo)
 - Patch 6: use a smaller locked section for comm_str__put
   add a locked wrapper for comm_str__findnew              (Arnaldo)

Kan Liang (6):
  perf tools: lock to protect namespaces and comm list
  perf tools: lock to protect comm_str rb tree
  perf top: implement multithreading for perf_event__synthesize_threads
  perf top: add option to set the number of thread for event synthesize
  perf top: switch to backward overwrite mode
  perf top: check the cost of perf_top__mmap_read

 tools/perf/Documentation/perf-top.txt |   3 +
 tools/perf/builtin-kvm.c              |   3 +-
 tools/perf/builtin-record.c           |   2 +-
 tools/perf/builtin-top.c              |  43 ++++++---
 tools/perf/builtin-trace.c            |   2 +-
 tools/perf/tests/mmap-thread-lookup.c |   2 +-
 tools/perf/ui/browsers/hists.c        |  12 ++-
 tools/perf/util/comm.c                |  18 +++-
 tools/perf/util/event.c               | 163 +++++++++++++++++++++++++++-------
 tools/perf/util/event.h               |   3 +-
 tools/perf/util/machine.c             |   8 +-
 tools/perf/util/machine.h             |   9 +-
 tools/perf/util/thread.c              |  53 +++++++++--
 tools/perf/util/thread.h              |   3 +
 tools/perf/util/top.h                 |   1 +
 15 files changed, 262 insertions(+), 63 deletions(-)

-- 
2.5.5