[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200323114207.222412-1-courbet@google.com>
Date: Mon, 23 Mar 2020 12:42:06 +0100
From: Clement Courbet <courbet@...gle.com>
To: unlisted-recipients:; (no To-header on input)
Cc: Clement Courbet <courbet@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
linux-kernel@...r.kernel.org, clang-built-linux@...glegroups.com
Subject: [PATCH] x86: Alias memset to __builtin_memset.
Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.
When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.
We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.
Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.
Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));
Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)
In terms of code size, this grows the clang-built kernel a
bit (+0.022%):
440285512 vmlinux.clang.after
440383608 vmlinux.clang.before
Signed-off-by: Clement Courbet <courbet@...gle.com>
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..7073c25aa4a3 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if (__GNUC__ >= 4)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog
Powered by blists - more mailing lists