lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a939565a-cdab-4d8b-938e-38e3d837d653@suse.cz>
Date: Thu, 25 Jul 2024 12:11:45 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: kernel test robot <oliver.sang@...el.com>,
 Hyunmin Lee <hyunminlr@...il.com>
Cc: oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org,
 Jeungwoo Yoo <casionwoo@...il.com>, Sangyun Kim <sangyun.kim@....ac.kr>,
 Hyeonggon Yoo <42.hyeyoo@...il.com>,
 Gwan-gyeong Mun <gwan-gyeong.mun@...el.com>, Christoph Lameter
 <cl@...ux.com>, David Rientjes <rientjes@...gle.com>, linux-mm@...ck.org,
 ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [mm/slub] 306c4ac989: stress-ng.seal.ops_per_sec
 5.2% improvement

On 7/25/24 10:04 AM, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 5.2% improvement of stress-ng.seal.ops_per_sec on:
> 
> 
> commit: 306c4ac9896b07b8872293eb224058ff83f81fac ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Well that's great news, but also highly unlikely that the commit would cause
such an improvement, as it only optimizes a once-per-boot operation of
create_kmalloc_caches(). Maybe there are secondary effects in different
order of slab cache creation resulting in some different cpu cache layout,
but such improvement could be machine and compiler specific and overall fragile.

> testcase: stress-ng
> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> parameters:
> 
> 	nr_threads: 100%
> 	testtime: 60s
> 	test: seal
> 	cpufreq_governor: performance
> 
> 
> 
> 
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240725/202407251553.12f35198-oliver.sang@intel.com
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>   gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/seal/stress-ng/60s
> 
> commit: 
>   844776cb65 ("mm/slub: mark racy access on slab->freelist")
>   306c4ac989 ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> 
> 844776cb65a77ef2 306c4ac9896b07b8872293eb224 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>       2.51 ± 27%      +1.9        4.44 ± 35%  mpstat.cpu.all.idle%
>     975100 ± 19%     +29.5%    1262643 ± 16%  numa-meminfo.node1.AnonPages.max
>     187.06 ±  4%     -11.5%     165.63 ± 10%  sched_debug.cfs_rq:/.runnable_avg.stddev
>       0.05 ± 18%     -40.0%       0.03 ± 58%  vmstat.procs.b
>   58973718            +5.2%   62024061        stress-ng.seal.ops
>     982893            +5.2%    1033732        stress-ng.seal.ops_per_sec
>   59045344            +5.2%   62095668        stress-ng.time.minor_page_faults
>     174957            +1.4%     177400        proc-vmstat.nr_slab_unreclaimable
>   63634761            +5.5%   67148443        proc-vmstat.numa_hit
>   63399995            +5.5%   66914221        proc-vmstat.numa_local
>   73601172            +6.1%   78073549        proc-vmstat.pgalloc_normal
>   59870250            +5.3%   63063514        proc-vmstat.pgfault
>   72718474            +6.0%   77106313        proc-vmstat.pgfree
>  1.983e+10            +1.3%   2.01e+10        perf-stat.i.branch-instructions
>   66023349            +5.6%   69728143        perf-stat.i.cache-misses
>  2.023e+08            +4.7%  2.117e+08        perf-stat.i.cache-references
>       7.22            -1.9%       7.08        perf-stat.i.cpi
>       9738            -5.6%       9196        perf-stat.i.cycles-between-cache-misses
>  8.799e+10            +1.6%  8.939e+10        perf-stat.i.instructions
>       0.14            +1.6%       0.14        perf-stat.i.ipc
>       8.71            +5.1%       9.16        perf-stat.i.metric.K/sec
>     983533            +4.7%    1029816        perf-stat.i.minor-faults
>     983533            +4.7%    1029816        perf-stat.i.page-faults
>       7.30           -18.4%       5.96 ± 44%  perf-stat.overall.cpi
>       9735           -21.3%       7658 ± 44%  perf-stat.overall.cycles-between-cache-misses
>       0.52            +0.1        0.62 ±  7%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
>       0.56            +0.1        0.67 ±  7%  perf-profile.calltrace.cycles-pp.ftruncate64
>       0.34 ± 70%      +0.3        0.60 ±  7%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
>      48.29            +0.6       48.86        perf-profile.calltrace.cycles-pp.__close
>      48.27            +0.6       48.84        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>      48.27            +0.6       48.84        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
>      48.26            +0.6       48.83        perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>       0.00            +0.6        0.58 ±  7%  perf-profile.calltrace.cycles-pp.__x64_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
>      48.21            +0.6       48.80        perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>      48.03            +0.6       48.68        perf-profile.calltrace.cycles-pp.dput.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      48.02            +0.6       48.66        perf-profile.calltrace.cycles-pp.__dentry_kill.dput.__fput.__x64_sys_close.do_syscall_64
>      47.76            +0.7       48.47        perf-profile.calltrace.cycles-pp.evict.__dentry_kill.dput.__fput.__x64_sys_close
>      47.19            +0.7       47.92        perf-profile.calltrace.cycles-pp._raw_spin_lock.evict.__dentry_kill.dput.__fput
>      47.11            +0.8       47.88        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.evict.__dentry_kill.dput
>       0.74            -0.3        0.48 ±  8%  perf-profile.children.cycles-pp.__munmap
>       0.69            -0.2        0.44 ±  9%  perf-profile.children.cycles-pp.__x64_sys_munmap
>       0.68            -0.2        0.44 ±  9%  perf-profile.children.cycles-pp.__vm_munmap
>       0.68            -0.2        0.45 ±  9%  perf-profile.children.cycles-pp.do_vmi_munmap
>       0.65            -0.2        0.42 ±  8%  perf-profile.children.cycles-pp.do_vmi_align_munmap
>       0.44            -0.2        0.28 ±  7%  perf-profile.children.cycles-pp.unmap_region
>       0.48            -0.1        0.36 ±  7%  perf-profile.children.cycles-pp.asm_exc_page_fault
>       0.42            -0.1        0.32 ±  7%  perf-profile.children.cycles-pp.do_user_addr_fault
>       0.42 ±  2%      -0.1        0.32 ±  7%  perf-profile.children.cycles-pp.exc_page_fault
>       0.38 ±  2%      -0.1        0.29 ±  7%  perf-profile.children.cycles-pp.handle_mm_fault
>       0.35 ±  2%      -0.1        0.27 ±  7%  perf-profile.children.cycles-pp.__handle_mm_fault
>       0.33 ±  2%      -0.1        0.26 ±  6%  perf-profile.children.cycles-pp.do_fault
>       0.21 ±  2%      -0.1        0.14 ±  8%  perf-profile.children.cycles-pp.lru_add_drain
>       0.22            -0.1        0.15 ± 11%  perf-profile.children.cycles-pp.alloc_inode
>       0.21 ±  2%      -0.1        0.15 ±  9%  perf-profile.children.cycles-pp.lru_add_drain_cpu
>       0.18 ±  2%      -0.1        0.12 ±  8%  perf-profile.children.cycles-pp.unmap_vmas
>       0.21 ±  2%      -0.1        0.14 ±  7%  perf-profile.children.cycles-pp.folio_batch_move_lru
>       0.17            -0.1        0.11 ±  8%  perf-profile.children.cycles-pp.unmap_page_range
>       0.16 ±  2%      -0.1        0.10 ±  9%  perf-profile.children.cycles-pp.zap_pte_range
>       0.16 ±  2%      -0.1        0.10 ±  9%  perf-profile.children.cycles-pp.zap_pmd_range
>       0.26 ±  2%      -0.1        0.20 ±  7%  perf-profile.children.cycles-pp.shmem_fault
>       0.50            -0.1        0.45 ±  8%  perf-profile.children.cycles-pp.mmap_region
>       0.26 ±  2%      -0.1        0.20 ±  7%  perf-profile.children.cycles-pp.__do_fault
>       0.26            -0.1        0.21 ±  6%  perf-profile.children.cycles-pp.shmem_get_folio_gfp
>       0.19 ±  2%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.write
>       0.22 ±  3%      -0.0        0.18 ±  5%  perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
>       0.11 ±  4%      -0.0        0.07 ± 10%  perf-profile.children.cycles-pp.mas_store_gfp
>       0.16 ±  2%      -0.0        0.12 ± 11%  perf-profile.children.cycles-pp.mas_wr_store_entry
>       0.14            -0.0        0.10 ± 10%  perf-profile.children.cycles-pp.mas_wr_node_store
>       0.08            -0.0        0.04 ± 45%  perf-profile.children.cycles-pp.msync
>       0.06            -0.0        0.02 ± 99%  perf-profile.children.cycles-pp.mas_find
>       0.12 ±  4%      -0.0        0.08 ± 11%  perf-profile.children.cycles-pp.inode_init_always
>       0.10 ±  3%      -0.0        0.07 ± 11%  perf-profile.children.cycles-pp.shmem_alloc_inode
>       0.16            -0.0        0.13 ±  9%  perf-profile.children.cycles-pp.__x64_sys_fcntl
>       0.11 ±  4%      -0.0        0.08 ± 11%  perf-profile.children.cycles-pp.shmem_file_write_iter
>       0.10 ±  4%      -0.0        0.08 ±  8%  perf-profile.children.cycles-pp.do_fcntl
>       0.15            -0.0        0.13 ±  8%  perf-profile.children.cycles-pp.destroy_inode
>       0.16 ±  3%      -0.0        0.14 ±  7%  perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
>       0.22 ±  3%      -0.0        0.20 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>       0.08            -0.0        0.06 ± 11%  perf-profile.children.cycles-pp.___slab_alloc
>       0.15 ±  3%      -0.0        0.12 ±  8%  perf-profile.children.cycles-pp.__destroy_inode
>       0.07 ±  7%      -0.0        0.04 ± 45%  perf-profile.children.cycles-pp.__call_rcu_common
>       0.13 ±  2%      -0.0        0.11 ±  8%  perf-profile.children.cycles-pp.perf_event_mmap
>       0.09            -0.0        0.07 ±  9%  perf-profile.children.cycles-pp.memfd_fcntl
>       0.06            -0.0        0.04 ± 44%  perf-profile.children.cycles-pp.native_irq_return_iret
>       0.08 ±  6%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.shmem_add_to_page_cache
>       0.12            -0.0        0.10 ±  6%  perf-profile.children.cycles-pp.perf_event_mmap_event
>       0.11 ±  3%      -0.0        0.09 ±  7%  perf-profile.children.cycles-pp.__lruvec_stat_mod_folio
>       0.10            -0.0        0.08 ±  8%  perf-profile.children.cycles-pp.uncharge_batch
>       0.12 ±  4%      -0.0        0.10 ±  6%  perf-profile.children.cycles-pp.entry_SYSCALL_64
>       0.05            +0.0        0.07 ±  5%  perf-profile.children.cycles-pp.__d_alloc
>       0.05            +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.d_alloc_pseudo
>       0.07            +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.file_init_path
>       0.06 ±  6%      +0.0        0.08 ±  8%  perf-profile.children.cycles-pp.security_file_alloc
>       0.07 ±  7%      +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.errseq_sample
>       0.04 ± 44%      +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.apparmor_file_alloc_security
>       0.09            +0.0        0.12 ±  5%  perf-profile.children.cycles-pp.init_file
>       0.15            +0.0        0.18 ±  7%  perf-profile.children.cycles-pp.common_perm_cond
>       0.15 ±  3%      +0.0        0.19 ±  8%  perf-profile.children.cycles-pp.security_file_truncate
>       0.20            +0.0        0.24 ±  7%  perf-profile.children.cycles-pp.notify_change
>       0.06            +0.0        0.10 ±  6%  perf-profile.children.cycles-pp.inode_init_owner
>       0.13            +0.0        0.18 ±  5%  perf-profile.children.cycles-pp.alloc_empty_file
>       0.10            +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.clear_nlink
>       0.47            +0.1        0.56 ±  7%  perf-profile.children.cycles-pp.do_ftruncate
>       0.49            +0.1        0.59 ±  7%  perf-profile.children.cycles-pp.__x64_sys_ftruncate
>       0.59            +0.1        0.70 ±  7%  perf-profile.children.cycles-pp.ftruncate64
>       0.28            +0.1        0.40 ±  6%  perf-profile.children.cycles-pp.alloc_file_pseudo
>      98.62            +0.2       98.77        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>      98.58            +0.2       98.74        perf-profile.children.cycles-pp.do_syscall_64
>      48.30            +0.6       48.86        perf-profile.children.cycles-pp.__close
>      48.26            +0.6       48.83        perf-profile.children.cycles-pp.__x64_sys_close
>      48.21            +0.6       48.80        perf-profile.children.cycles-pp.__fput
>      48.04            +0.6       48.68        perf-profile.children.cycles-pp.dput
>      48.02            +0.6       48.67        perf-profile.children.cycles-pp.__dentry_kill
>      47.77            +0.7       48.47        perf-profile.children.cycles-pp.evict
>       0.30            -0.1        0.23 ±  7%  perf-profile.self.cycles-pp._raw_spin_lock
>       0.10 ±  4%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__fput
>       0.08 ±  6%      -0.0        0.05 ±  8%  perf-profile.self.cycles-pp.inode_init_always
>       0.06            -0.0        0.04 ± 44%  perf-profile.self.cycles-pp.native_irq_return_iret
>       0.08            -0.0        0.06 ±  7%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
>       0.09            -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
>       0.07            +0.0        0.09 ±  7%  perf-profile.self.cycles-pp.__shmem_get_inode
>       0.06 ±  7%      +0.0        0.09 ±  9%  perf-profile.self.cycles-pp.errseq_sample
>       0.15 ±  2%      +0.0        0.18 ±  7%  perf-profile.self.cycles-pp.common_perm_cond
>       0.03 ± 70%      +0.0        0.06 ±  7%  perf-profile.self.cycles-pp.apparmor_file_alloc_security
>       0.06            +0.0        0.10 ±  7%  perf-profile.self.cycles-pp.inode_init_owner
>       0.10            +0.1        0.16 ±  6%  perf-profile.self.cycles-pp.clear_nlink
> 
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ