linux-kernel - [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250314175309.2263997-2-herton@redhat.com>
Date: Fri, 14 Mar 2025 14:53:09 -0300
From: "Herton R. Krzesinski" <herton@...hat.com>
To: x86@...nel.org
Cc: tglx@...utronix.de,
	mingo@...hat.com,
	bp@...en8.de,
	dave.hansen@...ux.intel.com,
	hpa@...or.com,
	linux-kernel@...r.kernel.org,
	torvalds@...ux-foundation.org,
	olichtne@...hat.com,
	atomasov@...hat.com,
	aokuliar@...hat.com
Subject: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()

Since the upstream series with user copy updates were merged upstream
with commit a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy
clarifications"), copy_user_generic() on x86_64 stopped doing alignment
of the writes to the destination to a 8 byte boundary for the non FSRM
case. Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.

Turns out that may cause some loss of performance/throughput on some use
cases and specific CPU/platforms. Lately I got two reports of
performance/throughput issues after a RHEL 9 kernel pulled the same
upstream series with updates to user copy functions. Both reports
consisted of running specific networking/TCP related testing using
iperf3. The first report was related to a Linux Bridge testing using
VMs on an specific machine with an AMD CPU (EPYC 7402), and after a
brief investigation it turned out that the later change through
commit ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs
without ERMS") helped/fixed the performance issue.

However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.

So I narrowed down the second regression use case, but running it
without traffic through a nic, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:

- Start the server binding it to CPU core/thread 19:
$ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000

- Start the client always binding/running on CPU core/thread 17, using
perf to get statistics:
$ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
    -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
    > stat-19.txt

For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:

$ lscpu
...
  Model name:             AMD EPYC 7742 64-Core Processor
...
Caches (sum of all):
  L1d:                    2 MiB (64 instances)
  L1i:                    2 MiB (64 instances)
  L2:                     32 MiB (64 instances)
  L3:                     256 MiB (16 instances)
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):      0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
  NUMA node1 CPU(s):      2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
  NUMA node2 CPU(s):      4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
  NUMA node3 CPU(s):      6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
...

So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the destination
in copy_user_generic. I made tables below, an example of a set results I
got, summarizing the results:

* No alignment case:
             CPU      RATE          SYS          TIME     sender-receiver
Server bind   19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
Server bind   21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
Server bind   23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%

* With this patch (aligning write of destination address to 8 byte boundary):
             CPU      RATE          SYS          TIME     sender-receiver
Server bind   19: 20.6Gbits/sec 15.084684000 21.009879874 75.8%-89.3%
Server bind   21: 20.2Gbits/sec 15.322023000 21.427787353 75.5%-89.5%
Server bind   23: 18.5Gbits/sec 17.520225000 23.390048552 78.1%-88.8%
Server bind none: 26.0Gbits/sec 12.617328000 16.682558421 80.1%-89.6%

So I consistently got better results when aligning the write. The
results above were run on a 6.14.0-rc6 based kernel. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process.

All the CPUs/platforms mentioned above did not have FRMS or ERMS
flag enabled/set. It's also worth to note that each pair of iperf3 runs
may get slightly different results on each run, but I always got consistent
higher results with the write alignment for this specific test of running
the processes on CPUs in different NUMA nodes.

Reported-by: Ondrej Lichtner <olichtne@...hat.com>
Signed-off-by: Herton R. Krzesinski <herton@...hat.com>
---
 arch/x86/include/asm/uaccess_64.h |  2 +-
 arch/x86/lib/copy_user_64.S       | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c52f0133425b..96a0710cfacd 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len)
 		"2:\n"
 		_ASM_EXTABLE_UA(1b, 2b)
 		:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
-		: : "memory", "rax");
+		: : "memory", "rax", "rdx", "r8");
 	clac();
 	return len;
 }
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index fc9fb5d06174..4e95e9ffa37e 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -30,22 +30,39 @@
  * it simpler for us, we can clobber rsi/rdi and rax freely.
  */
 SYM_FUNC_START(rep_movs_alternative)
+	movl %edi,%eax
+	andl $7,%eax
+	je .Laligned
+	cmpq $8,%rcx
+	jb .Lcopy_user_short
+	movl %eax,%edx
+	sub $8,%edx
+	negl %edx
+	movq %rdx,%r8
+	jmp .Lcopy_bytes_loop
+
+.Laligned:
 	cmpq $64,%rcx
 	jae .Llarge
 
 	cmp $8,%ecx
 	jae .Lword
 
+.Lcopy_user_short:
 	testl %ecx,%ecx
 	je .Lexit
-
 .Lcopy_user_tail:
+	movq %rcx,%r8
+	movl %ecx,%edx
+.Lcopy_bytes_loop:
 0:	movb (%rsi),%al
 1:	movb %al,(%rdi)
 	inc %rdi
 	inc %rsi
-	dec %rcx
-	jne .Lcopy_user_tail
+	dec %edx
+	jne .Lcopy_bytes_loop
+	sub %r8,%rcx
+	jne .Laligned
 .Lexit:
 	RET
 
-- 
2.47.1