[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aS5V4Xn9q32GDnnc@xsang-OptiPlex-9020>
Date: Tue, 2 Dec 2025 10:58:41 +0800
From: Oliver Sang <oliver.sang@...el.com>
To: Mateusz Guzik <mjguzik@...il.com>
CC: Linus Torvalds <torvalds@...ux-foundation.org>, <oe-lkp@...ts.linux.dev>,
<lkp@...el.com>, <linux-kernel@...r.kernel.org>, Borislav Petkov
<bp@...en8.de>, Sean Christopherson <seanjc@...gle.com>, Thomas Gleixner
<tglx@...utronix.de>, <oliver.sang@...el.com>
Subject: Re: [linus:master] [x86] 284922f4c5: stress-ng.sockfd.ops_per_sec
6.1% improvement
hi, Mateusz Guzik,
On Fri, Nov 28, 2025 at 11:11:46AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 7:30 AM kernel test robot <oliver.sang@...el.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 6.1% improvement of stress-ng.sockfd.ops_per_sec on:
> >
> >
> > commit: 284922f4c563aa3a8558a00f2a05722133237fe8 ("x86: uaccess: don't use runtime-const rewriting in modules")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> >
> > testcase: stress-ng
> > config: x86_64-rhel-9.4
> > compiler: gcc-14
> > test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> > parameters:
> >
> > nr_threads: 100%
> > testtime: 60s
> > test: sockfd
> > cpufreq_governor: performance
> >
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20251128/202511281306.51105b46-lkp@intel.com
> >
> > =========================================================================================
> > compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> > gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s
> >
> > commit:
> > 17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
> > 284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
> >
> > 17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 55674763 +6.1% 59075135 stress-ng.sockfd.ops
> > 927326 +6.1% 983845 stress-ng.sockfd.ops_per_sec
> > 3555 ± 3% +10.6% 3932 ± 3% perf-c2c.DRAM.remote
> > 4834 ± 3% +12.0% 5415 ± 3% perf-c2c.HITM.local
> > 2714 ± 2% +12.5% 3054 ± 3% perf-c2c.HITM.remote
> > 0.51 +3.9% 0.53 perf-stat.i.MPKI
> > 34903541 +5.2% 36715161 perf-stat.i.cache-misses
> > 1.072e+08 +5.8% 1.133e+08 perf-stat.i.cache-references
> > 18971 -5.5% 17932 perf-stat.i.cycles-between-cache-misses
> > 0.46 ± 30% +13.6% 0.52 perf-stat.overall.MPKI
> > 31330827 ± 30% +14.9% 36004895 perf-stat.ps.cache-misses
> > 96530576 ± 30% +15.3% 1.113e+08 perf-stat.ps.cache-references
> > 48.32 -0.2 48.16 perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg
> > 48.23 -0.2 48.07 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg
> > 48.34 -0.2 48.18 perf-profile.calltrace.cycles-pp.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg.____sys_recvmsg
> > 0.56 ± 4% +0.1 0.65 ± 9% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
> > 0.62 ± 3% +0.1 0.71 ± 8% perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
> > 0.56 ± 3% +0.1 0.65 ± 8% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 48.34 -0.2 48.18 perf-profile.children.cycles-pp.unix_del_edges
> > 0.15 ± 3% +0.0 0.17 ± 2% perf-profile.children.cycles-pp.__scm_recv_common
> > 0.08 ± 7% +0.0 0.10 ± 7% perf-profile.children.cycles-pp.lockref_put_return
> > 0.09 ± 5% +0.0 0.11 ± 6% perf-profile.children.cycles-pp.__fput
> > 0.35 ± 5% +0.1 0.43 ± 12% perf-profile.children.cycles-pp.do_open
> > 0.63 ± 3% +0.1 0.72 ± 8% perf-profile.children.cycles-pp.do_sys_openat2
> > 0.56 ± 3% +0.1 0.65 ± 8% perf-profile.children.cycles-pp.do_filp_open
> >
>
> While this may read suspicious as the change is supposed to be a nop
> for core kernel, it in fact is not as it adds:
> /* Used for modules: built-in code uses runtime constants */
> +unsigned long USER_PTR_MAX;
> +EXPORT_SYMBOL(USER_PTR_MAX);
>
> this should probably be __ro_after_init.
>
> The test at hand is heavily bottlenecked on the global lock in the
> garbage collector, which is not annotated with anything.
>
> On my kernel I see this (nm vmlinux | sort -nk 1):
> ffffffff846c0a20 b bsd_socket_locks
> ffffffff846c0e20 b bsd_socket_buckets
> ffffffff846c1620 b unix_nr_socks
> ffffffff846c1628 b gc_in_progress
> ffffffff846c1630 b unix_graph_cyclic_sccs
> ffffffff846c1638 b unix_gc_lock <--- THE LOCK
> ffffffff846c1640 b unix_vertex_unvisited_index
> ffffffff846c1648 b unix_graph_state
> ffffffff846c1660 b unix_stream_bpf_prot
> ffffffff846c1820 b unix_stream_prot_lock
> ffffffff846c1840 b unix_dgram_bpf_prot
> ffffffff846c1a00 b unix_dgram_prot_lock
>
> note how bsd_socket_buckets looks suspicious in its own right, but
> ignoring that bit, I'm guessing things got pushed around and it
> changed some of cacheline bouncing.
>
> while a full fix is beyond the scope of this patch(tm), perhaps the
> annotation below will stabilize it against random breakage. can you
> guys bench it.
in our tests, below patch introduces more peformance improvements.
in our oiginal report, 284922f4c5 has a 6.1% performance improvement comparing
to parent 17d85f33a8.
we applied your patch directly upon 284922f4c5. as below, now by
"284922f4c5 + your patch"
we observe a 12.8% performance improvements (still comparing to 17d85f33a8).
full comparison is as below [1]
Tested-by: kernel test robot <oliver.sang@...el.com>
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s
commit:
17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
c4f1335ec1 <---- patch
17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0 c4f1335ec1491688ec229c5cf26
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
55674763 +6.1% 59075135 +12.8% 62793623 stress-ng.sockfd.ops
927326 +6.1% 983845 +12.8% 1045895 stress-ng.sockfd.ops_per_sec
>
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 78323d43e63e..25f65817faab 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -199,7 +199,7 @@ static void unix_free_vertices(struct scm_fp_list *fpl)
> }
> }
>
> -static DEFINE_SPINLOCK(unix_gc_lock);
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(unix_gc_lock);
>
> void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver)
> {
[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sockfd/stress-ng/60s
commit:
17d85f33a8 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma")
284922f4c5 ("x86: uaccess: don't use runtime-const rewriting in modules")
c4f1335ec1 <---- patch
17d85f33a83b84e7 284922f4c563aa3a8558a00f2a0 c4f1335ec1491688ec229c5cf26
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
19.11 +0.1% 19.14 +1.4% 19.38 turbostat.RAMWatt
9751559 ± 3% -1.6% 9595510 ± 4% -15.3% 8261973 ± 2% proc-vmstat.pgalloc_normal
8538105 ± 3% -1.2% 8435833 ± 3% -18.8% 6932658 proc-vmstat.pgfree
3555 ± 3% +10.6% 3932 ± 3% +22.8% 4366 ± 8% perf-c2c.DRAM.remote
4834 ± 3% +12.0% 5415 ± 3% +20.3% 5813 ± 8% perf-c2c.HITM.local
2714 ± 2% +12.5% 3054 ± 3% +25.0% 3393 ± 8% perf-c2c.HITM.remote
64099 ± 30% +17.2% 75129 ± 11% +62.4% 104104 ± 8% sched_debug.cpu.nr_switches.avg
111196 ± 27% +18.7% 131994 ± 7% +48.0% 164614 ± 4% sched_debug.cpu.nr_switches.max
18142 ± 31% +20.8% 21917 ± 11% +52.6% 27692 ± 9% sched_debug.cpu.nr_switches.stddev
16326624 ± 7% +6.8% 17434335 ± 11% +47.3% 24056288 ± 8% time.involuntary_context_switches
27.78 +5.1% 29.18 +16.7% 32.42 time.user_time
15140259 ± 8% +7.8% 16319396 ± 12% +51.9% 23004276 ± 9% time.voluntary_context_switches
55674763 +6.1% 59075135 +12.8% 62793623 stress-ng.sockfd.ops
927326 +6.1% 983845 +12.8% 1045895 stress-ng.sockfd.ops_per_sec
16326624 ± 7% +6.8% 17434335 ± 11% +47.3% 24056288 ± 8% stress-ng.time.involuntary_context_switches
27.78 +5.1% 29.18 +16.7% 32.42 stress-ng.time.user_time
15140259 ± 8% +7.8% 16319396 ± 12% +51.9% 23004276 ± 9% stress-ng.time.voluntary_context_switches
0.51 +3.9% 0.53 +6.8% 0.55 perf-stat.i.MPKI
34903541 +5.2% 36715161 +10.8% 38686195 ± 2% perf-stat.i.cache-misses
1.072e+08 +5.8% 1.133e+08 +8.0% 1.157e+08 perf-stat.i.cache-references
518753 ± 7% +7.6% 557957 ± 11% +49.3% 774423 ± 8% perf-stat.i.context-switches
9.18 -1.0% 9.09 -3.4% 8.87 perf-stat.i.cpi
18971 -5.5% 17932 -10.2% 17042 perf-stat.i.cycles-between-cache-misses
2.34 ± 8% +6.6% 2.50 ± 12% +48.6% 3.48 ± 8% perf-stat.i.metric.K/sec
0.46 ± 30% +13.6% 0.52 +16.8% 0.54 perf-stat.overall.MPKI
0.10 ± 30% +10.3% 0.11 +13.1% 0.11 perf-stat.overall.ipc
31330827 ± 30% +14.9% 36004895 +21.0% 37920039 ± 2% perf-stat.ps.cache-misses
96530576 ± 30% +15.3% 1.113e+08 +17.7% 1.136e+08 perf-stat.ps.cache-references
467600 ± 31% +17.0% 546869 ± 12% +62.5% 759773 ± 8% perf-stat.ps.context-switches
6.231e+10 ± 30% +10.4% 6.876e+10 +13.0% 7.042e+10 perf-stat.ps.instructions
3.809e+12 ± 30% +10.4% 4.206e+12 +13.0% 4.305e+12 perf-stat.total.instructions
48.32 -0.2 48.16 -0.2 48.13 perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg
48.23 -0.2 48.07 -0.2 48.04 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg
48.34 -0.2 48.18 -0.2 48.15 perf-profile.calltrace.cycles-pp.unix_del_edges.unix_stream_read_generic.unix_stream_recvmsg.sock_recvmsg.____sys_recvmsg
49.18 -0.1 49.10 -1.7 47.47 ± 10% perf-profile.calltrace.cycles-pp.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
49.07 -0.1 48.99 -0.2 48.83 perf-profile.calltrace.cycles-pp.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64
49.17 -0.1 49.09 -1.7 47.46 ± 10% perf-profile.calltrace.cycles-pp.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
49.11 -0.1 49.03 -0.2 48.88 perf-profile.calltrace.cycles-pp.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
48.48 -0.1 48.40 -0.3 48.20 perf-profile.calltrace.cycles-pp.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg
48.46 -0.1 48.39 -0.3 48.18 perf-profile.calltrace.cycles-pp._raw_spin_lock.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg.___sys_sendmsg
48.36 -0.1 48.30 -0.3 48.09 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unix_add_edges.unix_stream_sendmsg.____sys_sendmsg
0.56 ± 4% +0.1 0.65 ± 9% +0.2 0.71 ± 13% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
0.62 ± 3% +0.1 0.71 ± 8% +0.2 0.79 ± 12% perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sockfd
0.56 ± 3% +0.1 0.65 ± 8% +0.2 0.72 ± 13% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
97.46 -0.2 97.30 -0.4 97.10 perf-profile.children.cycles-pp._raw_spin_lock
48.34 -0.2 48.18 -0.2 48.15 perf-profile.children.cycles-pp.unix_del_edges
96.94 -0.1 96.80 -0.4 96.59 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
49.08 -0.1 49.00 -0.2 48.83 perf-profile.children.cycles-pp.unix_stream_sendmsg
49.17 -0.1 49.09 -0.2 48.94 perf-profile.children.cycles-pp.___sys_sendmsg
49.18 -0.1 49.10 -0.2 48.96 perf-profile.children.cycles-pp.__sys_sendmsg
49.11 -0.1 49.03 -0.2 48.88 perf-profile.children.cycles-pp.____sys_sendmsg
48.48 -0.1 48.40 -0.3 48.20 perf-profile.children.cycles-pp.unix_add_edges
0.05 -0.0 0.05 ± 30% +0.0 0.06 ± 4% perf-profile.children.cycles-pp.arch_exit_to_user_mode_prepare
0.05 ± 30% -0.0 0.04 ± 46% +0.0 0.07 ± 10% perf-profile.children.cycles-pp.sock_def_readable
0.06 +0.0 0.06 +0.0 0.07 perf-profile.children.cycles-pp.alloc_empty_file
0.00 +0.0 0.00 +0.1 0.05 perf-profile.children.cycles-pp.refill_obj_stock
0.10 ± 4% +0.0 0.10 ± 4% +0.0 0.12 ± 3% perf-profile.children.cycles-pp.__kmalloc_cache_noprof
0.16 ± 9% +0.0 0.16 ± 15% +0.1 0.23 ± 12% perf-profile.children.cycles-pp.__schedule
0.16 ± 9% +0.0 0.16 ± 15% +0.1 0.23 ± 12% perf-profile.children.cycles-pp.schedule
0.17 ± 2% +0.0 0.18 ± 3% +0.0 0.20 perf-profile.children.cycles-pp.scm_fp_copy
0.21 +0.0 0.21 ± 2% +0.0 0.24 perf-profile.children.cycles-pp.__scm_send
0.05 +0.0 0.05 ± 9% +0.0 0.06 perf-profile.children.cycles-pp.__cond_resched
0.07 +0.0 0.07 ± 6% +0.0 0.08 ± 5% perf-profile.children.cycles-pp.copy_msghdr_from_user
0.00 +0.0 0.00 ±331% +0.1 0.05 perf-profile.children.cycles-pp.link_path_walk
0.01 ±223% +0.0 0.01 ±174% +0.1 0.06 ± 11% perf-profile.children.cycles-pp.pick_next_task_fair
0.06 +0.0 0.06 ± 7% +0.0 0.07 perf-profile.children.cycles-pp.free_uid
0.01 ±173% +0.0 0.02 ±118% +0.0 0.06 ± 6% perf-profile.children.cycles-pp.unix_scm_to_skb
0.07 ± 6% +0.0 0.08 ± 8% +0.0 0.09 ± 6% perf-profile.children.cycles-pp.__legitimize_path
0.06 ± 6% +0.0 0.07 ± 10% +0.0 0.08 ± 7% perf-profile.children.cycles-pp.terminate_walk
0.15 ± 3% +0.0 0.17 ± 2% +0.0 0.18 ± 2% perf-profile.children.cycles-pp.__scm_recv_common
0.00 +0.0 0.01 ±173% +0.1 0.05 perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
0.16 ± 3% +0.0 0.17 ± 4% +0.0 0.18 ± 2% perf-profile.children.cycles-pp.scm_recv_unix
0.09 ± 7% +0.0 0.10 ± 9% +0.0 0.11 ± 6% perf-profile.children.cycles-pp.dput
0.08 ± 7% +0.0 0.10 ± 7% +0.0 0.11 ± 5% perf-profile.children.cycles-pp.lockref_put_return
0.09 ± 5% +0.0 0.11 ± 6% +0.0 0.12 ± 6% perf-profile.children.cycles-pp.__fput
0.15 ± 4% +0.0 0.16 ± 4% +0.0 0.18 ± 4% perf-profile.children.cycles-pp.task_work_run
0.16 ± 5% +0.0 0.18 ± 6% +0.0 0.19 ± 4% perf-profile.children.cycles-pp.close_range
0.24 ± 3% +0.0 0.26 ± 6% +0.1 0.32 ± 4% perf-profile.children.cycles-pp.exit_to_user_mode_loop
0.19 ± 10% +0.1 0.25 ± 18% +0.1 0.29 ± 25% perf-profile.children.cycles-pp.chrdev_open
0.24 ± 8% +0.1 0.31 ± 15% +0.1 0.35 ± 21% perf-profile.children.cycles-pp.do_dentry_open
0.24 ± 7% +0.1 0.32 ± 15% +0.1 0.36 ± 20% perf-profile.children.cycles-pp.vfs_open
0.35 ± 5% +0.1 0.43 ± 12% +0.1 0.49 ± 15% perf-profile.children.cycles-pp.do_open
0.63 ± 3% +0.1 0.72 ± 8% +0.2 0.82 ± 9% perf-profile.children.cycles-pp.do_sys_openat2
0.63 ± 3% +0.1 0.72 ± 8% +0.2 0.82 ± 9% perf-profile.children.cycles-pp.__x64_sys_openat
0.56 ± 3% +0.1 0.65 ± 9% +0.2 0.74 ± 10% perf-profile.children.cycles-pp.path_openat
0.56 ± 3% +0.1 0.65 ± 8% +0.2 0.74 ± 10% perf-profile.children.cycles-pp.do_filp_open
96.49 -0.1 96.34 -0.3 96.14 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.06 +0.0 0.06 ± 4% +0.0 0.07 ± 5% perf-profile.self.cycles-pp.scm_fp_copy
0.00 ±331% +0.0 0.01 ±173% +0.1 0.06 ± 6% perf-profile.self.cycles-pp.unix_scm_to_skb
0.08 ± 8% +0.0 0.10 ± 8% +0.0 0.11 ± 6% perf-profile.self.cycles-pp.lockref_put_return
Powered by blists - more mailing lists