lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aS5V4Xn9q32GDnnc@xsang-OptiPlex-9020>
Date: Tue, 2 Dec 2025 10:58:41 +0800
From: Oliver Sang <oliver.sang@...el.com>
To: Mateusz Guzik <mjguzik@...il.com>
CC: Linus Torvalds <torvalds@...ux-foundation.org>, <oe-lkp@...ts.linux.dev>,
	<lkp@...el.com>, <linux-kernel@...r.kernel.org>, Borislav Petkov
	<bp@...en8.de>, Sean Christopherson <seanjc@...gle.com>, Thomas Gleixner
	<tglx@...utronix.de>, <oliver.sang@...el.com>
Subject: Re: [linus:master] [x86] 284922f4c5: stress-ng.sockfd.ops_per_sec
 6.1% improvement

hi, Mateusz Guzik,

On Fri, Nov 28, 2025 at 11:11:46AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 7:30 AM kernel test robot <oliver.sang@...el.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 6.1% improvement of stress-ng.sockfd.ops_per_sec on:
> >
> >
> > commit: 284922f4c563aa3a8558a00f2a05722133237fe8 ("x86: uaccess: don't use runtime-const rewriting in modules")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> >
> > testcase: stress-ng
> > config: x86_64-rhel-9.4
> > compiler: gcc-14
> > test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> > parameters:
> >
> >         nr_threads: 100%
> >         testtime: 60s
> >         test: sockfd
> >         cpufreq_governor: performance
> >
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20251128/202511281306.51105b46-lkp@intel.com
> >
> > =========================================================================================
> > compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> >   gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s
> >
> > commit:
> >   17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
> >   284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
> >
> > 17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0
> > ---------------- ---------------------------
> >          %stddev     %change         %stddev
> >              \          |                \
> >   55674763            +6.1%   59075135        stress-ng.sockfd.ops
> >     927326            +6.1%     983845        stress-ng.sockfd.ops_per_sec
> >       3555 ±  3%     +10.6%       3932 ±  3%  perf-c2c.DRAM.remote
> >       4834 ±  3%     +12.0%       5415 ±  3%  perf-c2c.HITM.local
> >       2714 ±  2%     +12.5%       3054 ±  3%  perf-c2c.HITM.remote
> >       0.51            +3.9%       0.53        perf-stat.i.MPKI
> >   34903541            +5.2%   36715161        perf-stat.i.cache-misses
> >  1.072e+08            +5.8%  1.133e+08        perf-stat.i.cache-references
> >      18971            -5.5%      17932        perf-stat.i.cycles-between-cache-misses
> >       0.46 ± 30%     +13.6%       0.52        perf-stat.overall.MPKI
> >   31330827 ± 30%     +14.9%   36004895        perf-stat.ps.cache-misses
> >   96530576 ± 30%     +15.3%  1.113e+08        perf-stat.ps.cache-references
> >      48.32            -0.2       48.16        perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg
> >      48.23            -0.2       48.07        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg
> >      48.34            -0.2       48.18        perf-profile.calltrace.cycles-pp.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg.____sys_recvmsg
> >       0.56 ±  4%      +0.1        0.65 ±  9%  perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
> >       0.62 ±  3%      +0.1        0.71 ±  8%  perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
> >       0.56 ±  3%      +0.1        0.65 ±  8%  perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >      48.34            -0.2       48.18        perf-profile.children.cycles-pp.unix_del_edges
> >       0.15 ±  3%      +0.0        0.17 ±  2%  perf-profile.children.cycles-pp.__scm_recv_common
> >       0.08 ±  7%      +0.0        0.10 ±  7%  perf-profile.children.cycles-pp.lockref_put_return
> >       0.09 ±  5%      +0.0        0.11 ±  6%  perf-profile.children.cycles-pp.__fput
> >       0.35 ±  5%      +0.1        0.43 ± 12%  perf-profile.children.cycles-pp.do_open
> >       0.63 ±  3%      +0.1        0.72 ±  8%  perf-profile.children.cycles-pp.do_sys_openat2
> >       0.56 ±  3%      +0.1        0.65 ±  8%  perf-profile.children.cycles-pp.do_filp_open
> >
> 
> While this may read suspicious as the change is supposed to be a nop
> for core kernel, it in fact is not as it adds:
> /* Used for modules: built-in code uses runtime constants */
> +unsigned long USER_PTR_MAX;
> +EXPORT_SYMBOL(USER_PTR_MAX);
> 
> this should probably be __ro_after_init.
> 
> The test at hand is heavily bottlenecked on the global lock in the
> garbage collector, which is not annotated with anything.
> 
> On my kernel I see this (nm vmlinux | sort -nk 1):
> ffffffff846c0a20 b bsd_socket_locks
> ffffffff846c0e20 b bsd_socket_buckets
> ffffffff846c1620 b unix_nr_socks
> ffffffff846c1628 b gc_in_progress
> ffffffff846c1630 b unix_graph_cyclic_sccs
> ffffffff846c1638 b unix_gc_lock <--- THE LOCK
> ffffffff846c1640 b unix_vertex_unvisited_index
> ffffffff846c1648 b unix_graph_state
> ffffffff846c1660 b unix_stream_bpf_prot
> ffffffff846c1820 b unix_stream_prot_lock
> ffffffff846c1840 b unix_dgram_bpf_prot
> ffffffff846c1a00 b unix_dgram_prot_lock
> 
> note how bsd_socket_buckets looks suspicious in its own right, but
> ignoring that bit, I'm guessing things got pushed around and it
> changed some of cacheline bouncing.
> 
> while a full fix is beyond the scope of this patch(tm), perhaps the
> annotation below will stabilize it against random breakage. can you
> guys bench it.

in our tests, below patch introduces more peformance improvements.

in our oiginal report, 284922f4c5 has a 6.1% performance improvement comparing
to parent 17d85f33a8.
we applied your patch directly upon 284922f4c5. as below, now by
"284922f4c5 + your patch"
we observe a 12.8% performance improvements (still comparing to 17d85f33a8).

full comparison is as below [1]

Tested-by: kernel test robot <oliver.sang@...el.com>


=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s

commit:
  17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
  284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
  c4f1335ec1 <---- patch

17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0 c4f1335ec1491688ec229c5cf26
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
  55674763            +6.1%   59075135           +12.8%   62793623        stress-ng.sockfd.ops
    927326            +6.1%     983845           +12.8%    1045895        stress-ng.sockfd.ops_per_sec


> 
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 78323d43e63e..25f65817faab 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -199,7 +199,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
>         }
>  }
> 
> -static DEFINE_SPINLOCK(unix_gc_lock);
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(unix_gc_lock);
> 
>  void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
>  {


[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s

commit: 
  17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
  284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
  c4f1335ec1 <---- patch

17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0 c4f1335ec1491688ec229c5cf26
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
     19.11            +0.1%      19.14            +1.4%      19.38        turbostat.RAMWatt
   9751559 ±  3%      -1.6%    9595510 ±  4%     -15.3%    8261973 ±  2%  proc-vmstat.pgalloc_normal
   8538105 ±  3%      -1.2%    8435833 ±  3%     -18.8%    6932658        proc-vmstat.pgfree
      3555 ±  3%     +10.6%       3932 ±  3%     +22.8%       4366 ±  8%  perf-c2c.DRAM.remote
      4834 ±  3%     +12.0%       5415 ±  3%     +20.3%       5813 ±  8%  perf-c2c.HITM.local
      2714 ±  2%     +12.5%       3054 ±  3%     +25.0%       3393 ±  8%  perf-c2c.HITM.remote
     64099 ± 30%     +17.2%      75129 ± 11%     +62.4%     104104 ±  8%  sched_debug.cpu.nr_switches.avg
    111196 ± 27%     +18.7%     131994 ±  7%     +48.0%     164614 ±  4%  sched_debug.cpu.nr_switches.max
     18142 ± 31%     +20.8%      21917 ± 11%     +52.6%      27692 ±  9%  sched_debug.cpu.nr_switches.stddev
  16326624 ±  7%      +6.8%   17434335 ± 11%     +47.3%   24056288 ±  8%  time.involuntary_context_switches
     27.78            +5.1%      29.18           +16.7%      32.42        time.user_time
  15140259 ±  8%      +7.8%   16319396 ± 12%     +51.9%   23004276 ±  9%  time.voluntary_context_switches
  55674763            +6.1%   59075135           +12.8%   62793623        stress-ng.sockfd.ops
    927326            +6.1%     983845           +12.8%    1045895        stress-ng.sockfd.ops_per_sec
  16326624 ±  7%      +6.8%   17434335 ± 11%     +47.3%   24056288 ±  8%  stress-ng.time.involuntary_context_switches
     27.78            +5.1%      29.18           +16.7%      32.42        stress-ng.time.user_time
  15140259 ±  8%      +7.8%   16319396 ± 12%     +51.9%   23004276 ±  9%  stress-ng.time.voluntary_context_switches
      0.51            +3.9%       0.53            +6.8%       0.55        perf-stat.i.MPKI
  34903541            +5.2%   36715161           +10.8%   38686195 ±  2%  perf-stat.i.cache-misses
 1.072e+08            +5.8%  1.133e+08            +8.0%  1.157e+08        perf-stat.i.cache-references
    518753 ±  7%      +7.6%     557957 ± 11%     +49.3%     774423 ±  8%  perf-stat.i.context-switches
      9.18            -1.0%       9.09            -3.4%       8.87        perf-stat.i.cpi
     18971            -5.5%      17932           -10.2%      17042        perf-stat.i.cycles-between-cache-misses
      2.34 ±  8%      +6.6%       2.50 ± 12%     +48.6%       3.48 ±  8%  perf-stat.i.metric.K/sec
      0.46 ± 30%     +13.6%       0.52           +16.8%       0.54        perf-stat.overall.MPKI
      0.10 ± 30%     +10.3%       0.11           +13.1%       0.11        perf-stat.overall.ipc
  31330827 ± 30%     +14.9%   36004895           +21.0%   37920039 ±  2%  perf-stat.ps.cache-misses
  96530576 ± 30%     +15.3%  1.113e+08           +17.7%  1.136e+08        perf-stat.ps.cache-references
    467600 ± 31%     +17.0%     546869 ± 12%     +62.5%     759773 ±  8%  perf-stat.ps.context-switches
 6.231e+10 ± 30%     +10.4%  6.876e+10           +13.0%  7.042e+10        perf-stat.ps.instructions
 3.809e+12 ± 30%     +10.4%  4.206e+12           +13.0%  4.305e+12        perf-stat.total.instructions
     48.32            -0.2       48.16            -0.2       48.13        perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg
     48.23            -0.2       48.07            -0.2       48.04        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg
     48.34            -0.2       48.18            -0.2       48.15        perf-profile.calltrace.cycles-pp.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg.____sys_recvmsg
     49.18            -0.1       49.10            -1.7       47.47 ± 10%  perf-profile.calltrace.cycles-pp.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
     49.07            -0.1       48.99            -0.2       48.83        perf-profile.calltrace.cycles-pp.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64
     49.17            -0.1       49.09            -1.7       47.46 ± 10%  perf-profile.calltrace.cycles-pp.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
     49.11            -0.1       49.03            -0.2       48.88        perf-profile.calltrace.cycles-pp.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
     48.48            -0.1       48.40            -0.3       48.20        perf-profile.calltrace.cycles-pp.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg
     48.46            -0.1       48.39            -0.3       48.18        perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg
     48.36            -0.1       48.30            -0.3       48.09        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg
      0.56 ±  4%      +0.1        0.65 ±  9%      +0.2        0.71 ± 13%  perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
      0.62 ±  3%      +0.1        0.71 ±  8%      +0.2        0.79 ± 12%  perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
      0.56 ±  3%      +0.1        0.65 ±  8%      +0.2        0.72 ± 13%  perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
     97.46            -0.2       97.30            -0.4       97.10        perf-profile.children.cycles-pp._raw_spin_lock
     48.34            -0.2       48.18            -0.2       48.15        perf-profile.children.cycles-pp.unix_del_edges
     96.94            -0.1       96.80            -0.4       96.59        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     49.08            -0.1       49.00            -0.2       48.83        perf-profile.children.cycles-pp.unix_stream_sendmsg
     49.17            -0.1       49.09            -0.2       48.94        perf-profile.children.cycles-pp.___sys_sendmsg
     49.18            -0.1       49.10            -0.2       48.96        perf-profile.children.cycles-pp.__sys_sendmsg
     49.11            -0.1       49.03            -0.2       48.88        perf-profile.children.cycles-pp.____sys_sendmsg
     48.48            -0.1       48.40            -0.3       48.20        perf-profile.children.cycles-pp.unix_add_edges
      0.05            -0.0        0.05 ± 30%      +0.0        0.06 ±  4%  perf-profile.children.cycles-pp.arch_exit_to_user_mode_prepare
      0.05 ± 30%      -0.0        0.04 ± 46%      +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.sock_def_readable
      0.06            +0.0        0.06            +0.0        0.07        perf-profile.children.cycles-pp.alloc_empty_file
      0.00            +0.0        0.00            +0.1        0.05        perf-profile.children.cycles-pp.refill_obj_stock
      0.10 ±  4%      +0.0        0.10 ±  4%      +0.0        0.12 ±  3%  perf-profile.children.cycles-pp.__kmalloc_cache_noprof
      0.16 ±  9%      +0.0        0.16 ± 15%      +0.1        0.23 ± 12%  perf-profile.children.cycles-pp.__schedule
      0.16 ±  9%      +0.0        0.16 ± 15%      +0.1        0.23 ± 12%  perf-profile.children.cycles-pp.schedule
      0.17 ±  2%      +0.0        0.18 ±  3%      +0.0        0.20        perf-profile.children.cycles-pp.scm_fp_copy
      0.21            +0.0        0.21 ±  2%      +0.0        0.24        perf-profile.children.cycles-pp.__scm_send
      0.05            +0.0        0.05 ±  9%      +0.0        0.06        perf-profile.children.cycles-pp.__cond_resched
      0.07            +0.0        0.07 ±  6%      +0.0        0.08 ±  5%  perf-profile.children.cycles-pp.copy_msghdr_from_user
      0.00            +0.0        0.00 ±331%      +0.1        0.05        perf-profile.children.cycles-pp.link_path_walk
      0.01 ±223%      +0.0        0.01 ±174%      +0.1        0.06 ± 11%  perf-profile.children.cycles-pp.pick_next_task_fair
      0.06            +0.0        0.06 ±  7%      +0.0        0.07        perf-profile.children.cycles-pp.free_uid
      0.01 ±173%      +0.0        0.02 ±118%      +0.0        0.06 ±  6%  perf-profile.children.cycles-pp.unix_scm_to_skb
      0.07 ±  6%      +0.0        0.08 ±  8%      +0.0        0.09 ±  6%  perf-profile.children.cycles-pp.__legitimize_path
      0.06 ±  6%      +0.0        0.07 ± 10%      +0.0        0.08 ±  7%  perf-profile.children.cycles-pp.terminate_walk
      0.15 ±  3%      +0.0        0.17 ±  2%      +0.0        0.18 ±  2%  perf-profile.children.cycles-pp.__scm_recv_common
      0.00            +0.0        0.01 ±173%      +0.1        0.05        perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
      0.16 ±  3%      +0.0        0.17 ±  4%      +0.0        0.18 ±  2%  perf-profile.children.cycles-pp.scm_recv_unix
      0.09 ±  7%      +0.0        0.10 ±  9%      +0.0        0.11 ±  6%  perf-profile.children.cycles-pp.dput
      0.08 ±  7%      +0.0        0.10 ±  7%      +0.0        0.11 ±  5%  perf-profile.children.cycles-pp.lockref_put_return
      0.09 ±  5%      +0.0        0.11 ±  6%      +0.0        0.12 ±  6%  perf-profile.children.cycles-pp.__fput
      0.15 ±  4%      +0.0        0.16 ±  4%      +0.0        0.18 ±  4%  perf-profile.children.cycles-pp.task_work_run
      0.16 ±  5%      +0.0        0.18 ±  6%      +0.0        0.19 ±  4%  perf-profile.children.cycles-pp.close_range
      0.24 ±  3%      +0.0        0.26 ±  6%      +0.1        0.32 ±  4%  perf-profile.children.cycles-pp.exit_to_user_mode_loop
      0.19 ± 10%      +0.1        0.25 ± 18%      +0.1        0.29 ± 25%  perf-profile.children.cycles-pp.chrdev_open
      0.24 ±  8%      +0.1        0.31 ± 15%      +0.1        0.35 ± 21%  perf-profile.children.cycles-pp.do_dentry_open
      0.24 ±  7%      +0.1        0.32 ± 15%      +0.1        0.36 ± 20%  perf-profile.children.cycles-pp.vfs_open
      0.35 ±  5%      +0.1        0.43 ± 12%      +0.1        0.49 ± 15%  perf-profile.children.cycles-pp.do_open
      0.63 ±  3%      +0.1        0.72 ±  8%      +0.2        0.82 ±  9%  perf-profile.children.cycles-pp.do_sys_openat2
      0.63 ±  3%      +0.1        0.72 ±  8%      +0.2        0.82 ±  9%  perf-profile.children.cycles-pp.__x64_sys_openat
      0.56 ±  3%      +0.1        0.65 ±  9%      +0.2        0.74 ± 10%  perf-profile.children.cycles-pp.path_openat
      0.56 ±  3%      +0.1        0.65 ±  8%      +0.2        0.74 ± 10%  perf-profile.children.cycles-pp.do_filp_open
     96.49            -0.1       96.34            -0.3       96.14        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.06            +0.0        0.06 ±  4%      +0.0        0.07 ±  5%  perf-profile.self.cycles-pp.scm_fp_copy
      0.00 ±331%      +0.0        0.01 ±173%      +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.unix_scm_to_skb
      0.08 ±  8%      +0.0        0.10 ±  8%      +0.0        0.11 ±  6%  perf-profile.self.cycles-pp.lockref_put_return


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ