[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aTnVEdv2j5ur48aF@meta.com>
Date: Wed, 10 Dec 2025 12:16:17 -0800
From: Ben Niu <BenNiu@...a.com>
To: Mark Rutland <mark.rutland@....com>
CC: <catalin.marinas@....com>, <will@...nel.org>, <tytso@....edu>,
<Jason@...c4.com>, <linux-arm-kernel@...ts.infradead.org>,
<niuben003@...il.com>, <benniu@...a.com>,
<linux-kernel@...r.kernel.org>
Subject: Withdraw [PATCH] tracing: Enable kprobe tracing for Arm64 asm
functions
On Mon, Nov 17, 2025 at 10:34:22AM +0000, Mark Rutland wrote:
> On Thu, Oct 30, 2025 at 11:07:51AM -0700, Ben Niu wrote:
> > On Thu, Oct 30, 2025 at 12:35:25PM +0000, Mark Rutland wrote:
> > > Is there something specific you want to trace, but cannot currently
> > > trace (on arm64)?
> >
> > For some reason, we only saw Arm64 Linux asm functions __arch_copy_to_user and
> > __arch_copy_from_user being hot in our workloads, not those counterpart asm
> > functions on x86, so we are trying to understand and improve performance of
> > those Arm64 asm functions.
>
> Are you sure that's not an artifact of those being out-of-line on arm64,
> but inline on x86? On x86, the out-of-line forms are only used when the
> CPU doesn't have FSRM, and when the CPU *does* have FSRM, the logic gets
> inlined. See raw_copy_from_user(), raw_copy_to_user(), and
> copy_user_generic() in arch/x86/include/asm/uaccess_64.h.
On x86, INLINE_COPY_TO_USER is not defined in the latest linux kernel and our
internal branch, so _copy_to_user is always defined as an extern function
(no-inline), which ends up inlining copy_user_generic. copy_user_generic
executes FSRM rep movs if CPU supports it (our case), otherwise, it calls
rep_movs_alternative, which issues plain movs to copy memory.
FSRM is vectorized internally by the CPU and Arm FEAT_MOPS seems similar to
what FSRM does, but Neoverse V2 does not support FEAT_MOPS so the kernel
resorts to sttr to copy memory.
> Have you checked that inlining is not skewing your results, and
> artificially making those look hotter on am64 by virtue of centralizing
> samples to the same IP/PC range?
As mentioned above, _copy_to_user is not inlined on x86.
> Can you share any information on those workloads? e.g. which callchains
> were hot?
Please reach out to James Greenhalgh and Chris Goodyer at Arm for more details
about those workloads, which I can't share in a public channel.
By bpftracing, we found that most of __arch_copy_to_user calls are against size
2k-4k. I wrote a simple microbenchmark to test and compare the performance of
copy_to_user on both x64 and arm64, see
https://github.com/mcfi/benchmark/tree/main/ku_copy
Below were data collected. You could see that Arm64 ku copy speed had a dip
when the size hits 2K-16K.
Routine Size Arm64 bytes/s / x64 bytes /s
copy_to_user 8 203.4%
copy_from_user 8 248.5%
copy_to_user 16 194.6%
copy_from_user 16 240.6%
copy_to_user 32 195.0%
copy_from_user 32 236.5%
copy_to_user 64 193.2%
copy_from_user 64 240.4%
copy_to_user 128 211.0%
copy_from_user 128 273.0%
copy_to_user 256 185.3%
copy_from_user 256 229.3%
copy_to_user 512 152.9%
copy_from_user 512 182.4%
copy_to_user 1024 121.0%
copy_from_user 1024 141.9%
copy_to_user 2048 95.1%
copy_from_user 2048 108.7%
copy_to_user 4096 78.9%
copy_from_user 4096 85.7%
copy_to_user 8192 69.7%
copy_from_user 8192 73.5%
copy_to_user 16384 65.9%
copy_from_user 16384 68.8%
copy_to_user 32768 117.3%
copy_from_user 32768 118.6%
copy_to_user 65536 115.5%
copy_from_user 65536 117.4%
copy_to_user 131072 114.6%
copy_from_user 131072 113.3%
copy_to_user 262144 114.4%
copy_from_user 262144 109.3%
I sent out a patch for faster __arch_copy_from_user/__arch_copy_to_user. Please
see this message ID 20251018052237.1368504-2-benniu@...a.com for more details.
> Mark.
In addition, I'm withdrawing this patch because we found a workaround to trace
__arch_copy_to_user/__arch_copy_from_user through watchpoints. This message ID
aTjpzfqTEGPcird5@...a.com should have all the details. Thank you for reviewing
this patch and all your helpful comments.
Ben
Powered by blists - more mailing lists