linux-kernel - Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <202503312148.c74b0351-lkp@intel.com>
Date: Mon, 31 Mar 2025 22:10:13 +0800
From: kernel test robot <oliver.sang@...el.com>
To: Nikhil Dhama <nikhil.dhama@....com>
CC: <oe-lkp@...ts.linux.dev>, <lkp@...el.com>, Andrew Morton
	<akpm@...ux-foundation.org>, Ying Huang <huang.ying.caritas@...il.com>,
	Bharata B Rao <bharata@....com>, Raghavendra
	<raghavendra.kodsarathimmappa@....com>, <linux-mm@...ck.org>,
	<ying.huang@...ux.alibaba.com>, Nikhil Dhama <nikhil.dhama@....com>,
	<linux-kernel@...r.kernel.org>, <oliver.sang@...el.com>
Subject: Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp
 flushes on deallocation


Hello,

kernel test robot noticed a 32.2% improvement of lmbench3.TCP.socket.bandwidth.10MB.MB/sec on:


commit: 6570c41610d1d2d3b143c253010b38ce9cbc0012 ("[PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")
url: https://github.com/intel-lab-lkp/linux/commits/Nikhil-Dhama/mm-pcp-scale-batch-to-reduce-number-of-high-order-pcp-flushes-on-deallocation/20250326-012247
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20250325171915.14384-1-nikhil.dhama@amd.com/
patch subject: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

testcase: lmbench3
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:

	test_memory_size: 50%
	nr_threads: 100%
	mode: development
	test: TCP
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250331/202503312148.c74b0351-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_threads/rootfs/tbox_group/test/test_memory_size/testcase:
  gcc-12/performance/x86_64-rhel-9.4/development/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/TCP/50%/lmbench3

commit: 
  7514d3cb91 ("foo")
  6570c41610 ("mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")

7514d3cb916f9344 6570c41610d1d2d3b143c253010 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    143.28 ± 38%     +49.0%     213.49 ± 20%  numa-vmstat.node1.nr_anon_transparent_hugepages
    118.00 ± 21%     +50.3%     177.33 ± 17%  perf-c2c.DRAM.local
    182485           +32.2%     241267        lmbench3.TCP.socket.bandwidth.10MB.MB/sec
  40582104 ±  6%    +114.4%   87026622 ±  2%  lmbench3.time.involuntary_context_switches
      0.46 ±  2%      +0.1        0.52 ±  3%  mpstat.cpu.all.irq%
      4.57 ± 11%      +1.4        5.96 ±  6%  mpstat.cpu.all.soft%
    291657 ± 38%     +49.6%     436355 ± 20%  numa-meminfo.node1.AnonHugePages
   4728254 ± 36%     +32.0%    6241931 ± 26%  numa-meminfo.node1.MemUsed
      0.40           -24.4%       0.30 ± 12%  perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
     13.88 ±  3%     -78.2%       3.03 ±157%  perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.50 ±  4%    +670.3%      11.58 ± 38%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
 1.209e+09 ±  3%      +6.5%  1.288e+09        proc-vmstat.numa_hit
 1.209e+09 ±  3%      +6.5%  1.287e+09        proc-vmstat.numa_local
 9.644e+09 ±  3%      +6.6%  1.028e+10        proc-vmstat.pgalloc_normal
 9.644e+09 ±  3%      +6.6%  1.028e+10        proc-vmstat.pgfree
  92870937 ± 14%     -17.9%   76271910 ±  8%  sched_debug.cfs_rq:/.avg_vruntime.avg
      2343 ± 10%     -17.3%       1938 ± 17%  sched_debug.cfs_rq:/.load.min
  92870938 ± 14%     -17.9%   76271910 ±  8%  sched_debug.cfs_rq:/.min_vruntime.avg
     13803 ± 10%     -22.2%      10740 ± 14%  sched_debug.cpu.curr->pid.min
      2.87 ±  9%     +69.1%       4.85 ±  4%  perf-stat.i.MPKI
      0.31 ±  6%      +0.0        0.34 ±  3%  perf-stat.i.branch-miss-rate%
     13.92            +1.1       15.06        perf-stat.i.cache-miss-rate%
 2.719e+08 ±  9%     +27.6%  3.469e+08 ±  4%  perf-stat.i.cache-misses
 5.658e+11            -2.5%  5.516e+11        perf-stat.i.cpu-cycles
 3.618e+11 ±  7%     +10.5%  3.996e+11 ±  4%  perf-stat.i.instructions
      1.64 ±  9%     -42.0%       0.95 ± 70%  perf-stat.overall.cpi
      2233 ± 11%     -50.7%       1100 ± 71%  perf-stat.overall.cycles-between-cache-misses
 5.691e+11           -35.0%  3.702e+11 ± 70%  perf-stat.ps.cpu-cycles




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki