linux-kernel - Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201013162006.GJ1063281@kernel.org>
Date:   Tue, 13 Oct 2020 13:20:06 -0300
From:   Arnaldo Carvalho de Melo <acme@...nel.org>
To:     Alexey Budankov <alexey.budankov@...ux.intel.com>
Cc:     Jiri Olsa <jolsa@...hat.com>, Namhyung Kim <namhyung@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic
 perf record operation

Em Mon, Oct 12, 2020 at 11:50:29AM +0300, Alexey Budankov escreveu:
> 
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".
> 
> The threaded mode executes one-to-one mapping of trace streaming threads
> to mapped data buffers and streaming into per-CPU trace files located
> at data directory. The data buffers and threads are affined to NUMA
> nodes and monitored CPUs according to system topology. --cpu option
> can be used to specify exact CPUs to be monitored.
> 
> Basic analysis of data directories is provided for perf report mode.
> Raw dump (-D) and aggregated reports are available for data directories,
> still with no memory consumption optimizations. However data directories
> collected with --compression-level option enabled can be analyzed with
> little less memory because trace files are unmaped from tool process
> memory after loading collected data.
> 
> Provided streaming mode is available with Zstd compression/decompression
> (--compression-level) and handling of external commands (--control).
> AUX area tracing, related and derived modes like --snapshot or
> --aux-sample are not enabled. --switch-output, --switch-output-event, 
> --switch-max-files and --timestamp-filename options are not enabled.

Would be interesting to spell out what are the difficulties to have
those options working with this threaded mode, as I expect that once
this is all reviewed and tested we should switch to it by default, no?

- Arnaldo

> Threaded trace streaming is not enabled for pipe mode. Asynchronous
> (--aio) trace streaming and affinity (--affinity) modes are mutually
> exclusive to threaded streaming mode.
> 
> See testing results and validation examples below.
> 
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
> 
> ---
> Alexey Budankov (15):
>   perf session: introduce trace file path to be shown in raw trace dump
>   perf report: output trace file name in raw trace dump
>   perf data: open data directory in read access mode
>   perf session: move reader object definition to header file
>   perf session: introduce decompressor into trace reader object
>   perf session: load data directory into tool process memory
>   perf record: introduce trace file, compressor and stats in mmap object
>   perf record: write trace data into mmap trace files
>   perf record: introduce thread specific objects for trace streaming
>   perf record: manage thread specific data array
>   perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
>   perf record: introduce thread local variable for trace streaming
>   perf record: stop threads in the end of trace streaming
>   perf record: start threads in the beginning of trace streaming
>   perf record: introduce --threads command line option
> 
>  tools/perf/Documentation/perf-record.txt |   7 +
>  tools/perf/builtin-record.c              | 514 +++++++++++++++++++++--
>  tools/perf/util/data.c                   |   4 +
>  tools/perf/util/evlist.c                 |  16 +
>  tools/perf/util/evlist.h                 |   1 +
>  tools/perf/util/mmap.c                   |   6 +
>  tools/perf/util/mmap.h                   |   5 +
>  tools/perf/util/ordered-events.h         |   1 +
>  tools/perf/util/record.h                 |   1 +
>  tools/perf/util/session.c                | 150 ++++---
>  tools/perf/util/session.h                |  28 ++
>  tools/perf/util/tool.h                   |   3 +-
>  12 files changed, 635 insertions(+), 101 deletions(-)
> 
> ---
> Testing results:
> 
>  $ perf test
>  1: vmlinux symtab matches kallsyms                                 : Skip
>  2: Detect openat syscall event                                     : Ok
>  3: Detect openat syscall event on all cpus                         : Ok
>  4: Read samples using the mmap interface                           : Ok
>  5: Test data source output                                         : Ok
>  6: Parse event definition strings                                  : Ok
>  7: Simple expression parser                                        : Ok
>  8: PERF_RECORD_* events & perf_sample fields                       : Ok
>  9: Parse perf pmu format                                           : Ok
> 10: PMU events                                                      :
> 10.1: PMU event table sanity                                        : Ok
> 10.2: PMU event map aliases                                         : Ok
> 10.3: Parsing of PMU event table metrics                            : Skip (some metrics failed)
> 10.4: Parsing of PMU event table metrics with fake PMUs             : Ok
> 11: DSO data read                                                   : Ok
> 12: DSO data cache                                                  : Ok
> 13: DSO data reopen                                                 : FAILED!
> 14: Roundtrip evsel->name                                           : Ok
> 15: Parse sched tracepoints fields                                  : Ok
> 16: syscalls:sys_enter_openat event fields                          : Ok
> 17: Setup struct perf_event_attr                                    : Ok
> 18: Match and link multiple hists                                   : Ok
> 19: 'import perf' in python                                         : FAILED!
> 20: Breakpoint overflow signal handler                              : Ok
> 21: Breakpoint overflow sampling                                    : Ok
> 22: Breakpoint accounting                                           : Ok
> 23: Watchpoint                                                      :
> 23.1: Read Only Watchpoint                                          : Skip
> 23.2: Write Only Watchpoint                                         : Ok
> 23.3: Read / Write Watchpoint                                       : Ok
> 23.4: Modify Watchpoint                                             : Ok
> 24: Number of exit events of a simple workload                      : Ok
> 25: Software clock events period values                             : Ok
> 26: Object code reading                                             : Ok
> 27: Sample parsing                                                  : Ok
> 28: Use a dummy software event to keep tracking                     : Ok
> 29: Parse with no sample_id_all bit set                             : Ok
> 30: Filter hist entries                                             : Ok
> 31: Lookup mmap thread                                              : Ok
> 32: Share thread maps                                               : Ok
> 33: Sort output of hist entries                                     : Ok
> 34: Cumulate child hist entries                                     : Ok
> 35: Track with sched_switch                                         : Ok
> 36: Filter fds with revents mask in a fdarray                       : Ok
> 37: Add fd to a fdarray, making it autogrow                         : Ok
> 38: kmod_path__parse                                                : Ok
> 39: Thread map                                                      : Ok
> 40: LLVM search and compile                                         :
> 40.1: Basic BPF llvm compile                                        : Skip
> 40.2: kbuild searching                                              : Skip
> 40.3: Compile source for BPF prologue generation                    : Skip
> 40.4: Compile source for BPF relocation                             : Skip
> 41: Session topology                                                : Ok
> 42: BPF filter                                                      :
> 42.1: Basic BPF filtering                                           : Skip
> 42.2: BPF pinning                                                   : Skip
> 42.3: BPF prologue generation                                       : Skip
> 42.4: BPF relocation checker                                        : Skip
> 43: Synthesize thread map                                           : Ok
> 44: Remove thread map                                               : Ok
> 45: Synthesize cpu map                                              : Ok
> 46: Synthesize stat config                                          : Ok
> 47: Synthesize stat                                                 : Ok
> 48: Synthesize stat round                                           : Ok
> 49: Synthesize attr update                                          : Ok
> 50: Event times                                                     : Ok
> 51: Read backward ring buffer                                       : Ok
> 52: Print cpu map                                                   : Ok
> 53: Merge cpu map                                                   : Ok
> 54: Probe SDT events                                                : Ok
> 55: is_printable_array                                              : Ok
> 56: Print bitmap                                                    : Ok
> 57: perf hooks                                                      : Ok
> 58: builtin clang support                                           : Skip (not compiled in)
> 59: unit_number__scnprintf                                          : Ok
> 60: mem2node                                                        : Ok
> 61: time utils                                                      : Ok
> 62: Test jit_write_elf                                              : Ok
> 63: Test libpfm4 support                                            : Skip (not compiled in)
> 64: Test api io                                                     : Ok
> 65: maps__merge_in                                                  : Ok
> 66: Demangle Java                                                   : Ok
> 67: Parse and process metrics                                       : Ok
> 68: PE file support                                                 : Ok
> 69: Event expansion for cgroups                                     : Ok
> 70: x86 rdpmc                                                       : Ok
> 71: Convert perf time to TSC                                        : Ok
> 72: DWARF unwind                                                    : Ok
> 73: x86 instruction decoder - new instructions                      : Ok
> 74: Intel PT packet decoder                                         : Ok
> 75: x86 bp modify                                                   : Ok
> 76: Check open filename arg using perf trace + vfs_getname          : FAILED!
> 77: Add vfs_getname probe to get syscall args filenames             : FAILED!
> 78: probe libc's inet_pton & backtrace it with ping                 : Ok
> 79: Use vfs_getname probe to get syscall args filenames             : FAILED!
> 80: Zstd perf.data compression/decompression                        : Ok
> 81: Check Arm CoreSight trace data recording and synthesized samples: Skip
> 
> Validation:
> 
>   $ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
>   Addr of buf1 = 0x7f8af24ae010
>   Offs of buf1 = 0x7f8af24ae180
>   Addr of buf2 = 0x7f8af04ad010
>   Offs of buf2 = 0x7f8af04ad1c0
>   Addr of buf3 = 0x7f8aee4ac010
>   Offs of buf3 = 0x7f8aee4ac100
>   Addr of buf4 = 0x7f8aec4ab010
>   Offs of buf4 = 0x7f8aec4ab140
>   Threads #: 8 Pthreads
>   Matrix size: 2048
>   Using multiply kernel: multiply1
>   Execution time = 24.579 seconds
>   [ perf record: Woken up 972 times to write data ]
>   [ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]
> 
>   $ perf report --header --stats
>   # ========
>   # captured on    : Wed Oct  7 19:07:50 2020
>   # header version : 1
>   # data offset    : 504
>   # data size      : 30056
>   # feat offset    : 30560
>   # hostname : nntvtune39
>   # os release : 5.8.8-100.fc31.x86_64
>   # perf version : 4.13.rc5.gdf19b1347b82
>   # arch : x86_64
>   # nrcpus online : 8
>   # nrcpus avail : 8
>   # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
>   # cpuid : GenuineIntel,6,94,3
>   # total memory : 16119484 kB
>   # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench  /matrix/linux/matrix.gcc.g.O3 
>   # event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
>   # event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
>   # sibling sockets : 0-7
>   # sibling dies    : 0-7
>   # sibling threads : 0,4
>   # sibling threads : 1,5
>   # sibling threads : 2,6
>   # sibling threads : 3,7
>   # CPU 0: Core ID 0, Die ID 0, Socket ID 0
>   # CPU 1: Core ID 1, Die ID 0, Socket ID 0
>   # CPU 2: Core ID 2, Die ID 0, Socket ID 0
>   # CPU 3: Core ID 3, Die ID 0, Socket ID 0
>   # CPU 4: Core ID 0, Die ID 0, Socket ID 0
>   # CPU 5: Core ID 1, Die ID 0, Socket ID 0
>   # CPU 6: Core ID 2, Die ID 0, Socket ID 0
>   # CPU 7: Core ID 3, Die ID 0, Socket ID 0
>   # node0 meminfo  : total = 16119484 kB, free = 933088 kB
>   # node0 cpu list : 0-7
>   # pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
>   # CPU cache info:
>   #  L1 Data                 32K [0,4]
>   #  L1 Instruction          32K [0,4]
>   #  L1 Data                 32K [1,5]
>   #  L1 Instruction          32K [1,5]
>   #  L1 Data                 32K [1,5]
>   #  L1 Instruction          32K [1,5]
>   #  L1 Data                 32K [2,6]
>   #  L1 Instruction          32K [2,6]
>   #  L1 Data                 32K [3,7]
>   #  L1 Instruction          32K [3,7]
>   #  L2 Unified             256K [0,4]
>   #  L2 Unified             256K [1,5]
>   #  L2 Unified             256K [2,6]
>   #  L2 Unified             256K [3,7]
>   #  L3 Unified            8192K [0-7]
>   # time of first sample : 0.000000
>   # time of last sample : 0.000000
>   # sample duration :      0.000 ms
>   # memory nodes (nr 1, block size 0x8000000):
>   #    0 [16G]: 0-23,32-135
>   # directory data version : 1
>   # bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
>   # bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
>   # bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
>   # bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
>   # bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
>   # bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
>   # bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
>   # bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
>   # bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
>   # bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
>   # bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
>   # bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
>   # bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
>   # compressed : Zstd, level = 3, ratio = 6
>   # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
>   # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA 
>   # ========
>   #
>   Aggregated stats:
>            TOTAL events:    1457107
>             MMAP events:         98
>             LOST events:          0
>             COMM events:          2
>             EXIT events:          6
>         THROTTLE events:          0
>       UNTHROTTLE events:          0
>             FORK events:          8
>             READ events:          0
>           SAMPLE events:    1456388
>            MMAP2 events:          7
>              AUX events:          0
>     ITRACE_START events:          0
>     LOST_SAMPLES events:          0
>           SWITCH events:          0
>  SWITCH_CPU_WIDE events:          0
>       NAMESPACES events:          0
>          KSYMBOL events:         13
>        BPF_EVENT events:         13
>           CGROUP events:        166
>        TEXT_POKE events:          0
>             ATTR events:          0
>       EVENT_TYPE events:          0
>     TRACING_DATA events:          0
>         BUILD_ID events:          0
>   FINISHED_ROUND events:          0
>         ID_INDEX events:          0
>    AUXTRACE_INFO events:          0
>         AUXTRACE events:          0
>   AUXTRACE_ERROR events:          0
>       THREAD_MAP events:          1
>          CPU_MAP events:          1
>      STAT_CONFIG events:          0
>             STAT events:          0
>       STAT_ROUND events:          0
>     EVENT_UPDATE events:          0
>        TIME_CONV events:          1
>          FEATURE events:          0
>       COMPRESSED events:        403
> 
>   $ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
>   Addr of buf1 = 0x7f15e8b89010
>   Offs of buf1 = 0x7f15e8b89180
>   Addr of buf2 = 0x7f15e6b88010
>   Offs of buf2 = 0x7f15e6b881c0
>   Addr of buf3 = 0x7f15e4b87010
>   Offs of buf3 = 0x7f15e4b87100
>   Addr of buf4 = 0x7f15e2b86010
>   Offs of buf4 = 0x7f15e2b86140
>   Threads #: 8 Pthreads
>   Matrix size: 2048
>   Using multiply kernel: multiply1
>   Execution time = 24.281 seconds
>   [ perf record: Woken up 6923 times to write data ]
>   [ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]
> 
>   $ perf report --stdio
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 708K of event 'cycles'
>   # Event count (approx.): 277718834163
>   #
>   # Children      Self  Command          Shared Object             Symbol                                                  
>   # ........  ........  ...............  ........................  ........................................................
>   #
>     99.97%     0.00%  matrix.gcc.g.O3  libc-2.30.so              [.] __GI___clone (inlined)
>             |
>             ---__GI___clone (inlined)
>                |          
>                 --99.97%--start_thread
>                           |          
>                            --99.97%--ThreadFunction
>                                      multiply1
>                                      |          
>                                       --0.96%--asm_sysvec_apic_timer_interrupt
>                                                 |          
>                                                  --0.70%--subflow_syn_recv_sock
>                                                            |          
>                                                             --0.60%--__sysvec_apic_timer_interrupt
>                                                                       |          
>                                                                        --0.57%--hrtimer_interrupt
> 
>     99.97%     0.00%  matrix.gcc.g.O3  libpthread-2.30.so        [.] start_thread
>             |
>             ---start_thread
>                |          
>                 --99.97%--ThreadFunction
>                           |          
>                            --99.97%--multiply1
>                                      |          
>                                       --0.96%--asm_sysvec_apic_timer_interrupt
>                                                 |          
>                                                  --0.70%--subflow_syn_recv_sock
>                                                            |          
>                                                             --0.60%--__sysvec_apic_timer_interrupt
>                                                                       |          
>                                                                        --0.57%--hrtimer_interrupt
> 
>     99.97%     0.00%  matrix.gcc.g.O3  matrix.gcc.g.O3           [.] ThreadFunction
>             |
>             ---ThreadFunction
>                |          
>                 --99.97%--multiply1
>                           |          
>                            --0.96%--asm_sysvec_apic_timer_interrupt
>                                      |          
>                                       --0.70%--subflow_syn_recv_sock
>                                                 |          
>                                                  --0.60%--__sysvec_apic_timer_interrupt
>                                                            |          
>                                                             --0.57%--hrtimer_interrupt
> 
>     99.97%    98.68%  matrix.gcc.g.O3  matrix.gcc.g.O3           [.] multiply1
>             |          
>             |--98.68%--__GI___clone (inlined)
>             |          start_thread
>             |          ThreadFunction
>             |          multiply1
>             |          
>              --1.28%--multiply1
>                        |          
>                         --0.96%--asm_sysvec_apic_timer_interrupt
>                                   |          
>                                    --0.70%--subflow_syn_recv_sock
>                                              |          
>                                               --0.60%--__sysvec_apic_timer_interrupt
>                                                         |          
>                                                          --0.57%--hrtimer_interrupt
> 
>      0.97%     0.05%  matrix.gcc.g.O3  [kernel.kallsyms]         [k] asm_sysvec_apic_timer_interrupt
>             |          
>              --0.91%--asm_sysvec_apic_timer_interrupt
>                        |          
>                         --0.70%--subflow_syn_recv_sock
> 
> 
> -- 
> 2.24.1
> 

-- 

- Arnaldo