[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202503312148.c74b0351-lkp@intel.com>
Date: Mon, 31 Mar 2025 22:10:13 +0800
From: kernel test robot <oliver.sang@...el.com>
To: Nikhil Dhama <nikhil.dhama@....com>
CC: <oe-lkp@...ts.linux.dev>, <lkp@...el.com>, Andrew Morton
<akpm@...ux-foundation.org>, Ying Huang <huang.ying.caritas@...il.com>,
Bharata B Rao <bharata@....com>, Raghavendra
<raghavendra.kodsarathimmappa@....com>, <linux-mm@...ck.org>,
<ying.huang@...ux.alibaba.com>, Nikhil Dhama <nikhil.dhama@....com>,
<linux-kernel@...r.kernel.org>, <oliver.sang@...el.com>
Subject: Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp
flushes on deallocation
Hello,
kernel test robot noticed a 32.2% improvement of lmbench3.TCP.socket.bandwidth.10MB.MB/sec on:
commit: 6570c41610d1d2d3b143c253010b38ce9cbc0012 ("[PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")
url: https://github.com/intel-lab-lkp/linux/commits/Nikhil-Dhama/mm-pcp-scale-batch-to-reduce-number-of-high-order-pcp-flushes-on-deallocation/20250326-012247
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20250325171915.14384-1-nikhil.dhama@amd.com/
patch subject: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
testcase: lmbench3
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:
test_memory_size: 50%
nr_threads: 100%
mode: development
test: TCP
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250331/202503312148.c74b0351-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_threads/rootfs/tbox_group/test/test_memory_size/testcase:
gcc-12/performance/x86_64-rhel-9.4/development/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/TCP/50%/lmbench3
commit:
7514d3cb91 ("foo")
6570c41610 ("mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")
7514d3cb916f9344 6570c41610d1d2d3b143c253010
---------------- ---------------------------
%stddev %change %stddev
\ | \
143.28 ± 38% +49.0% 213.49 ± 20% numa-vmstat.node1.nr_anon_transparent_hugepages
118.00 ± 21% +50.3% 177.33 ± 17% perf-c2c.DRAM.local
182485 +32.2% 241267 lmbench3.TCP.socket.bandwidth.10MB.MB/sec
40582104 ± 6% +114.4% 87026622 ± 2% lmbench3.time.involuntary_context_switches
0.46 ± 2% +0.1 0.52 ± 3% mpstat.cpu.all.irq%
4.57 ± 11% +1.4 5.96 ± 6% mpstat.cpu.all.soft%
291657 ± 38% +49.6% 436355 ± 20% numa-meminfo.node1.AnonHugePages
4728254 ± 36% +32.0% 6241931 ± 26% numa-meminfo.node1.MemUsed
0.40 -24.4% 0.30 ± 12% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
13.88 ± 3% -78.2% 3.03 ±157% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.50 ± 4% +670.3% 11.58 ± 38% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
1.209e+09 ± 3% +6.5% 1.288e+09 proc-vmstat.numa_hit
1.209e+09 ± 3% +6.5% 1.287e+09 proc-vmstat.numa_local
9.644e+09 ± 3% +6.6% 1.028e+10 proc-vmstat.pgalloc_normal
9.644e+09 ± 3% +6.6% 1.028e+10 proc-vmstat.pgfree
92870937 ± 14% -17.9% 76271910 ± 8% sched_debug.cfs_rq:/.avg_vruntime.avg
2343 ± 10% -17.3% 1938 ± 17% sched_debug.cfs_rq:/.load.min
92870938 ± 14% -17.9% 76271910 ± 8% sched_debug.cfs_rq:/.min_vruntime.avg
13803 ± 10% -22.2% 10740 ± 14% sched_debug.cpu.curr->pid.min
2.87 ± 9% +69.1% 4.85 ± 4% perf-stat.i.MPKI
0.31 ± 6% +0.0 0.34 ± 3% perf-stat.i.branch-miss-rate%
13.92 +1.1 15.06 perf-stat.i.cache-miss-rate%
2.719e+08 ± 9% +27.6% 3.469e+08 ± 4% perf-stat.i.cache-misses
5.658e+11 -2.5% 5.516e+11 perf-stat.i.cpu-cycles
3.618e+11 ± 7% +10.5% 3.996e+11 ± 4% perf-stat.i.instructions
1.64 ± 9% -42.0% 0.95 ± 70% perf-stat.overall.cpi
2233 ± 11% -50.7% 1100 ± 71% perf-stat.overall.cycles-between-cache-misses
5.691e+11 -35.0% 3.702e+11 ± 70% perf-stat.ps.cpu-cycles
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Powered by blists - more mailing lists