[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9de53d57-b1be-2186-c90b-dbeda21ed717@intel.com>
Date: Fri, 9 Nov 2018 08:19:54 +0800
From: Rong Chen <rong.a.chen@...el.com>
To: Daniel Borkmann <daniel@...earbox.net>,
Joe Stringer <joe@...d.net.nz>
Cc: Alexei Starovoitov <ast@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>, lkp@...org
Subject: Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0%
regression
On 11/02/2018 04:36 PM, Daniel Borkmann wrote:
> Hi Rong,
>
> On 11/02/2018 03:14 AM, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -4.0% regression of will-it-scale.per_process_ops due to commit:
>>
>>
>> commit: fd978bf7fd312581a7ca454a991f0ffb34c4204b ("bpf: Add reference tracking to verifier")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: will-it-scale
>> on test machine: 80 threads Skylake with 64G memory
>> with following parameters:
>>
>> nr_task: 100%
>> mode: process
>> test: mmap1
>> cpufreq_governor: performance
> Hmm, so the test cases you are running are these ones:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap1.c
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap2.c
>
> The commit from Joe referenced above only adds a feature to the (eBPF) verifier. Looking
> through will-it-scale test suite, looks like there's neither cBPF nor eBPF in use and if
> it would have been the former (e.g. via seccomp BPF), then also this has no effect on it
> since this doesn't load through bpf(2); meaning if so then something must use eBPF here,
> but then it's also unclear right now how this would even remotely affect mmap() test
> performance by -4%. Hm, are you certain it's not a false bisection? If so, what else is
> loading eBPF on your machine in parallel when you run the tests?
Please accept my apologies for taking your time, It's a false bisection.
Something strange happened, we're trying to figure out the root cause.
Best Regards,
Rong Chen
>
> Thanks,
> Daniel
>
>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> test-url: https://github.com/antonblanchard/will-it-scale
>>
>> In addition to that, the commit also has significant impact on the following tests:
>>
>> +------------------+---------------------------------------------------------------+
>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.8% regression |
>> | test machine | 80 threads Skylake with 64G memory |
>> | test parameters | cpufreq_governor=performance |
>> | | mode=process |
>> | | nr_task=100% |
>> | | test=mmap2 |
>> +------------------+---------------------------------------------------------------+
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>> gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap1/will-it-scale
>>
>> commit:
>> 84dbf35073 ("bpf: Macrofy stack state copy")
>> fd978bf7fd ("bpf: Add reference tracking to verifier")
>>
>> 84dbf3507349696b fd978bf7fd312581a7ca454a99
>> ---------------- --------------------------
>> %stddev %change %stddev
>> \ | \
>> 16811 -4.0% 16140 will-it-scale.per_process_ops
>> 1344946 -4.0% 1291230 will-it-scale.workload
>> 107.75 ± 38% +252.4% 379.75 ± 93% cpuidle.POLL.usage
>> 121.70 ± 18% +18.9% 144.70 ± 4% sched_debug.cfs_rq:/.exec_clock.stddev
>> 4933 +2.0% 5031 proc-vmstat.nr_inactive_anon
>> 4933 +2.0% 5031 proc-vmstat.nr_zone_inactive_anon
>> 9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.active_objs
>> 9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.num_objs
>> 12248 ± 50% +52.2% 18640 ± 2% numa-meminfo.node0.Inactive
>> 33943 ± 8% +16.2% 39453 ± 6% numa-meminfo.node0.SReclaimable
>> 91549 ± 9% -9.9% 82530 ± 7% numa-meminfo.node1.Slab
>> 18091 ± 15% +29.2% 23382 ± 17% numa-vmstat.node0
>> 3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_inactive_anon
>> 8485 ± 8% +16.2% 9862 ± 6% numa-vmstat.node0.nr_slab_reclaimable
>> 3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_zone_inactive_anon
>> 1.4e+12 -2.5% 1.364e+12 perf-stat.branch-instructions
>> 41.42 +0.7 42.15 perf-stat.cache-miss-rate%
>> 2.166e+10 -2.1% 2.12e+10 perf-stat.cache-references
>> 12.16 +2.7% 12.49 perf-stat.cpi
>> 1.741e+12 -2.8% 1.692e+12 perf-stat.dTLB-loads
>> 0.00 ± 3% +0.0 0.00 ± 9% perf-stat.dTLB-store-miss-rate%
>> 5.713e+11 -3.9% 5.49e+11 perf-stat.dTLB-stores
>> 6.103e+12 -2.6% 5.943e+12 perf-stat.instructions
>> 0.08 -2.6% 0.08 perf-stat.ipc
>> 1.954e+09 -1.8% 1.919e+09 perf-stat.node-load-misses
>> 4538060 +1.4% 4602862 perf-stat.path-length
>> 49.62 -0.5 49.14 perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 47.64 -0.5 47.17 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
>> 47.49 -0.5 47.02 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
>> 49.99 -0.5 49.53 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 49.96 -0.5 49.51 perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 48.02 -0.4 47.58 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>> 1.41 -0.0 1.37 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>> 47.73 +0.4 48.11 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
>> 47.85 +0.4 48.25 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
>> 48.28 +0.4 48.68 perf-profile.calltrace.cycles-pp.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
>> 48.23 +0.4 48.63 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
>> 48.96 +0.4 49.41 perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
>> 49.11 +0.5 49.56 perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 49.24 +0.5 49.70 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 49.25 +0.5 49.72 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> 49.62 -0.5 49.15 perf-profile.children.cycles-pp.do_munmap
>> 49.99 -0.5 49.53 perf-profile.children.cycles-pp.__x64_sys_munmap
>> 49.97 -0.5 49.51 perf-profile.children.cycles-pp.vm_munmap
>> 0.51 ± 2% -0.0 0.46 perf-profile.children.cycles-pp.___might_sleep
>> 1.16 -0.0 1.11 perf-profile.children.cycles-pp.unmap_vmas
>> 1.15 -0.0 1.10 perf-profile.children.cycles-pp.unmap_page_range
>> 1.41 -0.0 1.37 perf-profile.children.cycles-pp.unmap_region
>> 0.32 ± 2% +0.0 0.34 ± 2% perf-profile.children.cycles-pp.up_write
>> 0.32 ± 2% +0.0 0.34 perf-profile.children.cycles-pp.vm_area_alloc
>> 0.29 +0.0 0.32 perf-profile.children.cycles-pp.kmem_cache_alloc
>> 48.28 +0.4 48.68 perf-profile.children.cycles-pp.__vm_enough_memory
>> 48.96 +0.4 49.41 perf-profile.children.cycles-pp.mmap_region
>> 49.11 +0.5 49.56 perf-profile.children.cycles-pp.do_mmap
>> 49.25 +0.5 49.71 perf-profile.children.cycles-pp.vm_mmap_pgoff
>> 49.25 +0.5 49.72 perf-profile.children.cycles-pp.ksys_mmap_pgoff
>> 0.47 ± 3% -0.0 0.43 perf-profile.self.cycles-pp.___might_sleep
>> 0.32 ± 3% +0.0 0.34 ± 2% perf-profile.self.cycles-pp.up_write
>> 0.27 +0.0 0.30 perf-profile.self.cycles-pp.kmem_cache_alloc
>> 0.49 +0.0 0.53 perf-profile.self.cycles-pp.percpu_counter_add_batch
>>
>>
>>
>> will-it-scale.per_process_ops
>>
>> 18000 +-+-----------------------------------------------------------------+
>> | |
>> 17500 +-+ +.+ |
>> |+.+++ : +.++++.+++ ++++.++++.++ |
>> | :++. + : : : : |
>> 17000 +-+ + + ++.++: : ++.+++ : ++.+ ++. +. |
>> | + +.+ + + +++ +|
>> 16500 +-+ |
>> | O OOOO OOOO O O |
>> 16000 +-+ O O O O O |
>> | |
>> O O OOO O |
>> 15500 +O+OOO O |
>> | |
>> 15000 +-+-----------------------------------------------------------------+
>>
>>
>> will-it-scale.workload
>>
>> 1.42e+06 +-+--------------------------------------------------------------+
>> 1.4e+06 +-+ ++ |
>> |++.++ : ++. +++.+ |
>> 1.38e+06 +-+ : +.+++ ++ ++++.++ : |
>> 1.36e+06 +-+ +.+++++. : : : :+ |
>> | ++++ ++.+++++.+ + ++.+++++.++|
>> 1.34e+06 +-+ |
>> 1.32e+06 +-+ |
>> 1.3e+06 +-+ O |
>> | OO OO OO OOOOO OOO |
>> 1.28e+06 +-+ O |
>> 1.26e+06 +-+ O |
>> O O O OO |
>> 1.24e+06 +OO OO O |
>> 1.22e+06 +-+--------------------------------------------------------------+
>>
>>
>> [*] bisect-good sample
>> [O] bisect-bad sample
>>
>> ***************************************************************************************************
>> lkp-skl-2sp2: 80 threads Skylake with 64G memory
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>> gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap2/will-it-scale
>>
>> commit:
>> 84dbf35073 ("bpf: Macrofy stack state copy")
>> fd978bf7fd ("bpf: Add reference tracking to verifier")
>>
>> 84dbf3507349696b fd978bf7fd312581a7ca454a99
>> ---------------- --------------------------
>> %stddev %change %stddev
>> \ | \
>> 16832 -3.8% 16186 will-it-scale.per_process_ops
>> 1346634 -3.8% 1294984 will-it-scale.workload
>> 390809 ± 21% +51.6% 592424 ± 27% cpuidle.C1.time
>> 6897 +2.7% 7085 proc-vmstat.nr_mapped
>> 936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.active_objs
>> 936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.num_objs
>> 968.00 ± 9% +27.5% 1233 ± 16% slabinfo.pool_workqueue.active_objs
>> 968.00 ± 9% +29.7% 1255 ± 16% slabinfo.pool_workqueue.num_objs
>> 8430 -14.1% 7244 ± 2% numa-meminfo.node0.KernelStack
>> 4283 ± 14% -22.4% 3325 ± 10% numa-meminfo.node0.PageTables
>> 73929 ± 3% -10.6% 66061 ± 6% numa-meminfo.node0.SUnreclaim
>> 5569 ± 2% +21.0% 6738 ± 3% numa-meminfo.node1.KernelStack
>> 55085 ± 5% +17.5% 64739 ± 5% numa-meminfo.node1.SUnreclaim
>> 81155 ± 6% +16.2% 94292 ± 7% numa-meminfo.node1.Slab
>> 230.00 -100.0% 0.00 numa-vmstat.node0.nr_active_file
>> 100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_inactive_file
>> 8431 -14.1% 7245 ± 2% numa-vmstat.node0.nr_kernel_stack
>> 1071 ± 14% -22.4% 831.25 ± 10% numa-vmstat.node0.nr_page_table_pages
>> 18482 ± 3% -10.6% 16515 ± 6% numa-vmstat.node0.nr_slab_unreclaimable
>> 230.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file
>> 100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_zone_inactive_file
>> 5569 ± 2% +21.0% 6738 ± 3% numa-vmstat.node1.nr_kernel_stack
>> 2778 ± 3% +28.4% 3567 ± 16% numa-vmstat.node1.nr_mapped
>> 13771 ± 5% +17.5% 16184 ± 5% numa-vmstat.node1.nr_slab_unreclaimable
>> 1.506e+12 -2.5% 1.468e+12 perf-stat.branch-instructions
>> 41.41 +0.8 42.20 perf-stat.cache-miss-rate%
>> 2.165e+10 -1.7% 2.129e+10 perf-stat.cache-references
>> 11.25 +2.8% 11.57 perf-stat.cpi
>> 1.891e+12 -2.8% 1.838e+12 perf-stat.dTLB-loads
>> 6.543e+11 -3.7% 6.3e+11 perf-stat.dTLB-stores
>> 6.591e+12 -2.6% 6.419e+12 perf-stat.instructions
>> 0.09 -2.7% 0.09 perf-stat.ipc
>> 1.967e+09 -1.3% 1.941e+09 perf-stat.node-load-misses
>> 4894750 +1.3% 4956596 perf-stat.path-length
>> 40.37 ± 12% -16.2% 33.81 ± 7% sched_debug.cfs_rq:/.load_avg.stddev
>> 0.05 ± 2% +18.7% 0.06 ± 3% sched_debug.cfs_rq:/.nr_running.stddev
>> 6.37 ± 40% -50.2% 3.17 ± 32% sched_debug.cfs_rq:/.removed.load_avg.avg
>> 31.64 ± 18% -28.5% 22.63 ± 16% sched_debug.cfs_rq:/.removed.load_avg.stddev
>> 293.89 ± 40% -50.1% 146.61 ± 32% sched_debug.cfs_rq:/.removed.runnable_sum.avg
>> 1459 ± 18% -28.3% 1045 ± 16% sched_debug.cfs_rq:/.removed.runnable_sum.stddev
>> 2.46 ± 43% -60.9% 0.96 ± 66% sched_debug.cfs_rq:/.removed.util_avg.avg
>> 12.42 ± 26% -46.5% 6.64 ± 59% sched_debug.cfs_rq:/.removed.util_avg.stddev
>> 385.92 ± 6% +12.8% 435.46 ± 2% sched_debug.cpu.nr_switches.min
>> -14.21 -31.4% -9.75 sched_debug.cpu.nr_uninterruptible.min
>> 47.54 -0.2 47.31 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
>> 47.67 -0.2 47.45 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
>> 48.04 -0.2 47.86 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>> 99.36 -0.0 99.34 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>> 1.47 +0.0 1.51 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
>> 94.77 -0.3 94.52 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>> 95.04 -0.2 94.81 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>> 95.77 -0.2 95.60 perf-profile.children.cycles-pp.percpu_counter_add_batch
>> 49.72 -0.1 49.58 perf-profile.children.cycles-pp.do_munmap
>> 0.53 ± 2% -0.1 0.47 perf-profile.children.cycles-pp.___might_sleep
>> 0.30 ± 2% +0.0 0.33 perf-profile.children.cycles-pp.perf_event_mmap
>> 0.30 ± 3% +0.0 0.33 ± 2% perf-profile.children.cycles-pp.vm_area_alloc
>> 0.33 ± 2% +0.0 0.36 ± 2% perf-profile.children.cycles-pp.up_write
>> 1.48 +0.0 1.51 perf-profile.children.cycles-pp.unmap_region
>> 94.77 -0.3 94.52 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>> 0.48 ± 2% -0.0 0.44 perf-profile.self.cycles-pp.___might_sleep
>> 0.33 ± 2% +0.0 0.36 ± 2% perf-profile.self.cycles-pp.up_write
>> 0.53 +0.0 0.57 perf-profile.self.cycles-pp.unmap_page_range
>> 0.47 +0.0 0.52 ± 2% perf-profile.self.cycles-pp.percpu_counter_add_batch
>>
>>
>>
>>
>>
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
>>
>> Thanks,
>> Rong Chen
>>
Powered by blists - more mailing lists