linux-kernel - [linux-next:master] [padata] 71203f68c7: unixbench.throughput 3.1% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <202506261012.11b518e7-lkp@intel.com>
Date: Thu, 26 Jun 2025 10:57:15 +0800
From: kernel test robot <oliver.sang@...el.com>
To: Herbert Xu <herbert@...dor.apana.org.au>
CC: <oe-lkp@...ts.linux.dev>, <lkp@...el.com>, <linux-crypto@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <oliver.sang@...el.com>
Subject: [linux-next:master] [padata]  71203f68c7:  unixbench.throughput 3.1%
 regression



Hello,


normally we won't report performance results if we suspect it is caused by
alignment problems.

since this patch touches the code related with alignment:

-       struct work_struct              reorder_work;
-       spinlock_t                      ____cacheline_aligned lock;

we still make out below report FYI what's the possible performance impact.


kernel test robot noticed a 3.1% regression of unixbench.throughput on:


commit: 71203f68c7749609d7fc8ae6ad054bdedeb24f91 ("padata: Fix pd UAF once and for all")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master 1b152eeca84a02bdb648f16b82ef3394007a9dcf]

testcase: unixbench
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	runtime: 300s
	nr_task: 100%
	test: fsbuffer-w
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+--------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops 1.1% improvement |
| test machine     | 104 threads 2 sockets (Skylake) with 192G memory             |
| test parameters  | cpufreq_governor=performance                                 |
|                  | mode=thread                                                  |
|                  | nr_task=100%                                                 |
|                  | test=pwrite1                                                 |
+------------------+--------------------------------------------------------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@...el.com>
| Closes: https://lore.kernel.org/oe-lkp/202506261012.11b518e7-lkp@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250626/202506261012.11b518e7-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/300s/lkp-icl-2sp9/fsbuffer-w/unixbench

commit: 
  73c2437109 ("crypto: s390/sha3 - Use cpu byte-order when exporting")
  71203f68c7 ("padata: Fix pd UAF once and for all")

73c2437109c3eab2 71203f68c7749609d7fc8ae6ad0 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    111306            +2.0%     113530        proc-vmstat.pgreuse
      0.01 ±  4%     +14.9%       0.01        perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
    715.14 ±167%     -99.4%       3.99 ± 80%  perf-sched.total_sch_delay.max.ms
  33278354            -3.1%   32249523        unixbench.throughput
      1854            -4.8%       1765        unixbench.time.user_time
 1.233e+10            -3.1%  1.195e+10        unixbench.workload
 4.717e+10            -3.1%  4.573e+10        perf-stat.i.branch-instructions
      0.42            -0.0        0.41        perf-stat.i.branch-miss-rate%
  28489209 ±  2%     -10.9%   25397034        perf-stat.i.branch-misses
      0.97            +2.8%       1.00        perf-stat.i.cpi
 1.946e+11            -3.1%  1.886e+11        perf-stat.i.instructions
      1.05            -2.8%       1.02        perf-stat.i.ipc
      0.06 ±  2%      -0.0        0.06        perf-stat.overall.branch-miss-rate%
      0.94            +3.2%       0.97        perf-stat.overall.cpi
      1.06            -3.1%       1.03        perf-stat.overall.ipc
 4.706e+10            -3.1%  4.562e+10        perf-stat.ps.branch-instructions
  28421825 ±  2%     -10.9%   25336865        perf-stat.ps.branch-misses
 1.942e+11            -3.1%  1.882e+11        perf-stat.ps.instructions
 7.212e+13            -3.1%  6.991e+13        perf-stat.total.instructions


***************************************************************************************************
lkp-skl-fpga01: 104 threads 2 sockets (Skylake) with 192G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/thread/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/pwrite1/will-it-scale

commit: 
  73c2437109 ("crypto: s390/sha3 - Use cpu byte-order when exporting")
  71203f68c7 ("padata: Fix pd UAF once and for all")

73c2437109c3eab2 71203f68c7749609d7fc8ae6ad0 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    997120            +1.5%    1011812        proc-vmstat.pgfree
  55606929            +1.1%   56223715        will-it-scale.104.threads
    534681            +1.1%     540612        will-it-scale.per_thread_ops
  55606929            +1.1%   56223715        will-it-scale.workload
      0.01 ± 34%     +63.9%       0.02 ± 31%  perf-sched.sch_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
    233.78 ±143%    +242.3%     800.28        perf-sched.wait_and_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
     74.68 ±  6%     +18.5%      88.49 ±  7%  perf-sched.wait_and_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
      4.69 ± 44%     -84.7%       0.72 ± 30%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
    576.00 ±  9%     -17.7%     473.83 ±  7%  perf-sched.wait_and_delay.count.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
    999.68           -98.9%      11.47 ± 85%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
    234.63 ±142%    +240.8%     799.73        perf-sched.wait_time.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
     74.67 ±  6%     +18.5%      88.46 ±  7%  perf-sched.wait_time.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
      4.31 ± 48%     -91.6%       0.36 ± 29%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
    999.37           -99.3%       6.65 ± 67%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
 1.686e+10            +1.1%  1.704e+10        perf-stat.i.branch-instructions
      1.62            +0.0        1.67        perf-stat.i.branch-miss-rate%
 2.726e+08            +4.1%  2.836e+08        perf-stat.i.branch-misses
      3.36            -1.0%       3.32        perf-stat.i.cpi
 8.562e+10            +1.1%  8.656e+10        perf-stat.i.instructions
      1.62            +0.0        1.66        perf-stat.overall.branch-miss-rate%
      3.36            -1.0%       3.33        perf-stat.overall.cpi
  1.68e+10            +1.1%  1.698e+10        perf-stat.ps.branch-instructions
 2.717e+08            +4.1%  2.827e+08        perf-stat.ps.branch-misses
 8.533e+10            +1.1%  8.627e+10        perf-stat.ps.instructions
 2.578e+13            +1.1%  2.607e+13        perf-stat.total.instructions
      4.44 ±  3%      -0.9        3.51 ±  5%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_safe_stack.__libc_pwrite
     11.68            -0.3       11.41        perf-profile.calltrace.cycles-pp.copy_folio_from_iter_atomic.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
      0.85 ±  8%      -0.2        0.70 ±  3%  perf-profile.calltrace.cycles-pp.file_remove_privs_flags.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64.do_syscall_64
      0.81 ±  3%      -0.1        0.67        perf-profile.calltrace.cycles-pp.balance_dirty_pages_ratelimited_flags.generic_perform_write.shmem_file_write_iter.vfs_write.__x64_sys_pwrite64
      2.26            -0.1        2.13        perf-profile.calltrace.cycles-pp.fdget.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_pwrite
      0.94            -0.0        0.90 ±  2%  perf-profile.calltrace.cycles-pp.noop_dirty_folio.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write
      1.14            +0.1        1.19 ±  2%  perf-profile.calltrace.cycles-pp.folio_mark_dirty.shmem_write_end.generic_perform_write.shmem_file_write_iter.vfs_write
      2.18 ±  2%      +0.4        2.58 ± 10%  perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
      2.58 ±  3%      -0.5        2.07 ±  4%  perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      7.87            -0.5        7.38        perf-profile.children.cycles-pp.entry_SYSCALL_64
     11.90            -0.3       11.62        perf-profile.children.cycles-pp.copy_folio_from_iter_atomic
      9.13            -0.3        8.87        perf-profile.children.cycles-pp.rep_movs_alternative
      0.87 ±  8%      -0.1        0.72 ±  3%  perf-profile.children.cycles-pp.file_remove_privs_flags
      0.84 ±  3%      -0.1        0.71        perf-profile.children.cycles-pp.balance_dirty_pages_ratelimited_flags
      2.26            -0.1        2.13        perf-profile.children.cycles-pp.fdget
      1.01            -0.0        0.96        perf-profile.children.cycles-pp.noop_dirty_folio
      0.43 ±  2%      -0.0        0.39 ±  3%  perf-profile.children.cycles-pp.rcu_all_qs
      0.29 ±  3%      -0.0        0.26        perf-profile.children.cycles-pp.inode_to_bdi
      0.30            -0.0        0.27        perf-profile.children.cycles-pp.x64_sys_call
      0.35 ±  2%      -0.0        0.32        perf-profile.children.cycles-pp.rw_verify_area
      0.39 ±  3%      +0.2        0.58 ± 24%  perf-profile.children.cycles-pp.xas_load
      2.20 ±  2%      +0.4        2.61 ± 10%  perf-profile.children.cycles-pp.filemap_get_entry
      6.97            -0.5        6.48        perf-profile.self.cycles-pp.entry_SYSCALL_64
      2.14 ±  3%      -0.4        1.71 ±  2%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      2.60            -0.3        2.28 ±  4%  perf-profile.self.cycles-pp.do_syscall_64
      8.90            -0.2        8.65        perf-profile.self.cycles-pp.rep_movs_alternative
      2.28            -0.2        2.10        perf-profile.self.cycles-pp.shmem_write_end
      0.86 ±  8%      -0.1        0.71 ±  3%  perf-profile.self.cycles-pp.file_remove_privs_flags
      2.24            -0.1        2.11        perf-profile.self.cycles-pp.fdget
      0.58 ±  4%      -0.1        0.48        perf-profile.self.cycles-pp.balance_dirty_pages_ratelimited_flags
      0.70 ±  3%      -0.1        0.62 ±  3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
      0.61 ±  2%      -0.1        0.55        perf-profile.self.cycles-pp.generic_write_checks
      0.34 ±  2%      -0.0        0.31 ±  3%  perf-profile.self.cycles-pp.rcu_all_qs
      0.28            -0.0        0.25        perf-profile.self.cycles-pp.x64_sys_call
      0.25 ±  4%      -0.0        0.22        perf-profile.self.cycles-pp.inode_to_bdi
      0.24 ±  3%      -0.0        0.21        perf-profile.self.cycles-pp.rw_verify_area
      0.77 ±  2%      +0.0        0.82 ±  2%  perf-profile.self.cycles-pp.folio_mark_dirty
      0.72            +0.1        0.78 ±  2%  perf-profile.self.cycles-pp.current_time
      0.20 ±  3%      +0.2        0.39 ± 31%  perf-profile.self.cycles-pp.xas_load
      1.80 ±  2%      +0.2        2.02 ±  6%  perf-profile.self.cycles-pp.filemap_get_entry
      9.38            +0.2        9.61        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      2.18 ±  2%      +0.3        2.50 ±  5%  perf-profile.self.cycles-pp.shmem_write_begin
      2.54            +0.8        3.37 ±  4%  perf-profile.self.cycles-pp.__libc_pwrite





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki