linux-kernel - Re: [RFCv2 00/48] perf tools: Add threads to record command

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b2f205a9-2d8d-7cf2-0ed7-b6b1ecff990b@linux.intel.com>
Date:   Fri, 21 Sep 2018 15:15:56 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Jiri Olsa <jolsa@...hat.com>
Cc:     Jiri Olsa <jolsa@...nel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>,
        Andi Kleen <andi@...stfloor.org>
Subject: Re: [RFCv2 00/48] perf tools: Add threads to record command

Hello Jiri,

On 21.09.2018 9:13, Alexey Budankov wrote:
> Hello Jiri,
> 
> On 14.09.2018 12:37, Alexey Budankov wrote:
>> On 14.09.2018 11:28, Jiri Olsa wrote:
>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote:
>>>
>>> SNIP
>>>
>>>>>> The threaded monitoring currently can't monitor backward maps
>>>>>> and there are probably more limitations which I haven't spotted
>>>>>> yet.
>>>>>>
>>>>>> So far I tested on laptop:
>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt
>>>>>>
>>>>>> and a one bigger server:
>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt
>>>>>>
>>>>>> I can see decrease in recorded LOST events, but both the benchmark
>>>>>> and the monitoring must be carefully configured wrt:
>>>>>>   - number of events (frequency)
>>>>>>   - size of the memory maps
>>>>>>   - size of events (callchains)
>>>>>>   - final perf.data size
>>>>>>
>>>>>> It's also available in:
>>>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>>>>>>   perf/record_threads
>>>>>>
>>>>>> thoughts? ;-) thanks
>>>>>> jirka
>>>>>
>>>>> It is preferable to split into smaller pieces that bring 
>>>>> some improvement proved by metrics numbers and ready for 
>>>>> merging and upstream. Do we have more metrics than the 
>>>>> data loss from trace AIO patches?
>>>>
>>>> well the primary focus is to get more events in,
>>>> so the LOST metric is the main one
>>>
>>> actualy I was hoping, could you please run it through the same
>>> tests as you do for AIO code on some huge server? 
>>
>> Yeah, I will, but it takes some time.
> 
> Here it is:
> 
> Hardware:
> cat /proc/cpuinfo
> processor	: 271
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 133
> model name	: Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz
> stepping	: 0
> microcode	: 0xe
> cpu MHz		: 1064.235
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 272
> core id		: 73
> cpu cores	: 68
> apicid		: 295
> initial apicid	: 295
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 13
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts avx512_vpopcntdq avx512_4vnniw avx512_4fmaps
> bugs		: cpu_meltdown spectre_v1 spectre_v2
> bogomips	: 2594.07
> clflush size	: 64
> cache_alignment	: 64
> address sizes	: 46 bits physical, 48 bits virtual
> power management:
> 
> uname -a
> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> cat /proc/sys/kernel/perf_event_paranoid
> 0
> 
> cat /proc/sys/kernel/perf_event_mlock_kb 
> 516
> 
> cat /proc/sys/kernel/perf_event_max_sample_rate 
> 3000
> 
> cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.5 (Maipo)
> 
> Metrics:
> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
> data loss (%)        : paused_time / elapsed_time_under_profiling
> LOST events          : stat from perf report --stats
> SAMPLE events        : stat from perf report --stats
> perf.data size (B)   : size of trace file on disk
> 
> Events:
> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
> 
> =================================================
> 
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
> 	-a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>         -e cpu/period=P,event=0x3c/Duk,\
>            cpu/period=P,umask=0x3/Duk,\
>            cpu/period=P,event=0xc0/Duk,\
>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> 
> Workload: matrix multiplication in 256 threads
> 
> /usr/bin/time ./matrix.icc
> Addr of buf1 = 0x7ff9faa73010
> Offs of buf1 = 0x7ff9faa73180
> Addr of buf2 = 0x7ff9f8a72010
> Offs of buf2 = 0x7ff9f8a721c0
> Addr of buf3 = 0x7ff9f6a71010
> Offs of buf3 = 0x7ff9f6a71100
> Addr of buf4 = 0x7ff9f4a70010
> Offs of buf4 = 0x7ff9f4a70140
> Threads #: 256 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Freq = 0.997720 GHz
> Execution time = 9.061 seconds
> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata 100448maxresident)k
> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps
> 
> T : 272
>         P (period, ms)       : 0.1
> 	runtime overhead (%) : 45x ~ 323.54 / 7.12
> 	data loss (%)        : 96
> 	LOST events          : 323662
> 	SAMPLE events        : 31885479
>         perf.data size (GiB) : 42
> 
> 	P (period, ms)       : 0.25
> 	runtime overhead (%) : 25x ~ 180.76 / 7.12
> 	data loss (%)        : 69 
> 	LOST events          : 10636
> 	SAMPLE events        : 18692998
>         perf.data size (GiB) : 23.5
> 
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 16x ~ 119.49 / 7.12
> 	data loss (%)        : 1
> 	LOST events          : 6
> 	SAMPLE events        : 11178524
>         perf.data size (GiB) : 14
> 
> T : 128
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 15x ~ 111.98 / 7.12
> 	data loss (%)        : 62
> 	LOST events          : 2825
> 	SAMPLE events        : 11267247
>         perf.data size (GiB) : 15
> 
> T : 64
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 14x ~ 101.55 / 7.12
> 	data loss (%)        : 67
> 	LOST events          : 5155
> 	SAMPLE events        : 10966297
>         perf.data size (GiB) : 13.7
> 
> Workload: matrix multiplication in 128 threads
> 
> /usr/bin/time ./matrix.gcc
> Addr of buf1 = 0x7f072e630010
> Offs of buf1 = 0x7f072e630180
> Addr of buf2 = 0x7f072c62f010
> Offs of buf2 = 0x7f072c62f1c0
> Addr of buf3 = 0x7f072a62e010
> Offs of buf3 = 0x7f072a62e100
> Addr of buf4 = 0x7f072862d010
> Offs of buf4 = 0x7f072862d140
> Threads #: 128 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 6.639 seconds
> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 100756maxresident)k
> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
> 
> T : 272
>         P (period, ms)       : 0.1
> 	runtime overhead (%) : 29x ~ 198.81 / 6.81
> 	data loss (%)        : 21
> 	LOST events          : 2502
> 	SAMPLE events        : 22481062
>         perf.data size (GiB) : 27.6
> 
> 	P (period, ms)       : 0.25
> 	runtime overhead (%) : 13x ~ 88.47 / 6.81
> 	data loss (%)        : 0
> 	LOST events          : 0
> 	SAMPLE events        : 9572787
>         perf.data size (GiB) : 11.3
> 
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 10x ~ 67.11 / 6.81
> 	data loss (%)        : 1
> 	LOST events          : 137
> 	SAMPLE events        : 6985930
>         perf.data size (GiB) : 8
> 
> T : 128
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 9.5x ~ 64.33 / 6.81
> 	data loss (%)        : 1
> 	LOST events          : 3
> 	SAMPLE events        : 6666903
>         perf.data size (GiB) : 7.8
> 
> T : 64
> 	P (period, ms)       : 0.25
> 	runtime overhead (%) : 17x ~ 114.27 / 6.81
> 	data loss (%)        : 2
> 	LOST events          : 52
> 	SAMPLE events        : 12643645
>         perf.data size (GiB) : 15.5
> 
> 	P (period, ms)       : 0.35 
> 	runtime overhead (%) : 10x ~ 68.60 / 6.81
> 	data loss (%)        : 1
> 	LOST events          : 93
> 	SAMPLE events        : 7164368
>         perf.data size (GiB) : 8.5

and this is for AIO and serial:

Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \
	-a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
         --clockid=monotonic_raw -- ./matrix.(icc|gcc)

Workload: matrix multiplication in 256 threads

 N : 512
        P (period, ms)       : 2.5
 	runtime overhead (%) : 2.7x ~ 19.21 / 7.12
 	data loss (%)        : 42
 	LOST events          : 1600
 	SAMPLE events        : 1235928
        perf.data size (GiB) : 1.5
 
 N : 272
 	P (period, ms)       : 1.5
 	runtime overhead (%) : 2.5x ~ 18.09 / 7.12
 	data loss (%)        : 89
 	LOST events          : 3457
 	SAMPLE events        : 1222143
        perf.data size (GiB) : 1.5

 	P (period, ms)       : 2
 	runtime overhead (%) : 2.5x ~ 17.93 / 7.12
 	data loss (%)        : 65
 	LOST events          : 2496
 	SAMPLE events        : 1240754
        perf.data size (GiB) : 1.5

 	P (period, ms)       : 2.5
 	runtime overhead (%) : 2.5x ~ 17.87 / 7.12
 	data loss (%)        : 44
 	LOST events          : 1621
 	SAMPLE events        : 1221949
        perf.data size (GiB) : 1.5

 	P (period, ms)       : 3
 	runtime overhead (%) : 2.5x ~ 18.43 / 7.12
 	data loss (%)        : 12
 	LOST events          : 350
 	SAMPLE events        : 1117972
        perf.data size (GiB) : 1.3
 
 N : 128
 	P (period, ms)       : 3
 	runtime overhead (%) : 2.4x ~ 17.08 / 7.12
 	data loss (%)        : 11
 	LOST events          : 335
 	SAMPLE events        : 1116832
        perf.data size (GiB) : 1.3
 
 N : 64
 	P (period, ms)       : 3
 	runtime overhead (%) : 2.2x ~ 16.03 / 7.12
 	data loss (%)        : 11
 	LOST events          : 329
 	SAMPLE events        : 1108205
        perf.data size (GiB) : 1.3
 
Workload: matrix multiplication in 128 threads

 N : 512
        P (period, ms)       : 1
 	runtime overhead (%) : 3.5x ~ 23.72 / 6.81
 	data loss (%)        : 18
 	LOST events          : 1043
 	SAMPLE events        : 2015306
        perf.data size (GiB) : 2.3

 N : 272
        P (period, ms)       : 0.5
 	runtime overhead (%) : 3x ~ 22.72 / 6.81
 	data loss (%)        : 90
 	LOST events          : 5842
 	SAMPLE events        : 2205937
        perf.data size (GiB) : 2.5

        P (period, ms)       : 1
 	runtime overhead (%) : 3x ~ 22.79 / 6.81
 	data loss (%)        : 11
 	LOST events          : 481
 	SAMPLE events        : 2017099
        perf.data size (GiB) : 2.5
 
 	P (period, ms)       : 1.5
 	runtime overhead (%) : 3x ~ 19.93 / 6.81
 	data loss (%)        : 5
 	LOST events          : 190
 	SAMPLE events        : 1308692
        perf.data size (GiB) : 1.5
 
 	P (period, ms)       : 2
 	runtime overhead (%) : 3x ~ 18.95 / 6.81
 	data loss (%)        : 0
 	LOST events          : 0
 	SAMPLE events        : 1010769
        perf.data size (GiB) : 1.2
 
 N : 128
 	P (period, ms)       : 1.5
 	runtime overhead (%) : 3x ~ 19.08 / 6.81
 	data loss (%)        : 6
 	LOST events          : 220
 	SAMPLE events        : 1322240
        perf.data size (GiB) : 1.5
 
 N : 64
 	P (period, ms)       : 1.5
 	runtime overhead (%) : 3x ~ 19.43 / 6.81
 	data loss (%)        : 3
 	LOST events          : 130
 	SAMPLE events        : 1386521
        perf.data size (GiB) : 1.6

=================================================

Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \
	-a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
         --clockid=monotonic_raw -- ./matrix.(icc|gcc)

Workload: matrix multiplication in 256 threads

	P (period, ms)       : 7.5
 	runtime overhead (%) : 1.6x ~ 11.6 / 7.12
 	data loss (%)        : 1
 	LOST events          : 1
 	SAMPLE events        : 451062
        perf.data size (GiB) : 0.5

Workload: matrix multiplication in 128 threads

	P (period, ms)       : 3
 	runtime overhead (%) : 1.8x ~ 12.58 / 6.81
 	data loss (%)        : 9
 	LOST events          : 147
 	SAMPLE events        : 673299
        perf.data size (GiB) : 0.8

Thanks,
Alexey

> 
> Thanks,
> Alexey
> 
>>
>>>
>>> thanks,
>>> jirka
>>>
>>
>