[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090224165144.GA25174@elte.hu>
Date: Tue, 24 Feb 2009 17:51:44 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Nick Piggin <nickpiggin@...oo.com.au>,
Salman Qazi <sqazi@...gle.com>, davem@...emloft.net,
linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>, Andi Kleen <andi@...stfloor.org>
Subject: Re: Performance regression in write() syscall
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > No, but I think it should be in arch code, and the
> > "_nocache" suffix should just be a hint to the architecture
> > that the destination is not so likely to be used.
>
> Yes. Especially since arch code is likely to need various
> arch-specific checks anyway (like the x86 code does about
> aligning the destination).
I'm inclined to do this in two or three phases: first apply the
fix from Salman in the form below.
In practice if we get a 4K copy request the likelyhood is large
that this is for a larger write and for a full pagecache page.
The target is very unlikely to be misaligned, the source might
be as it comes from user-space.
This portion of __copy_user_nocache() becomes largely
unnecessary:
ENTRY(__copy_user_nocache)
CFI_STARTPROC
cmpl $8,%edx
jb 20f /* less then 8 bytes, go to byte copy loop */
ALIGN_DESTINATION
movl %edx,%ecx
andl $63,%edx
shrl $6,%ecx
jz 17f
And the tail portion becomes unnecessary too. Those are over a
dozen instructions so probably worth optimizing out.
But i'd rather express this in terms of a separate
__copy_user_page_nocache function and keep the generic
implementation too.
I.e. like the second patch below. (not tested)
With this __copy_user_nocache() becomes unused - and once we are
happy with the performance characteristics of 4K non-temporal
copies, we can remove this more generic implementation.
Does this sound reasonable, or do you think we can be smarter
than this?
Ingo
-------------------->
>From 30d697fa3a25fed809a873b17531a00282dc1234 Mon Sep 17 00:00:00 2001
From: Salman Qazi <sqazi@...gle.com>
Date: Mon, 23 Feb 2009 18:03:04 -0800
Subject: [PATCH] x86: fix performance regression in write() syscall
While the introduction of __copy_from_user_nocache (see commit:
0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
for sufficiently large writes, there is evidence to show that it is
deterimental for small writes. Unixbench's fstime test gives the
following results for 256 byte writes with MAX_BLOCK of 2000:
2.6.29-rc6 ( 5 samples, each in KB/sec ):
283750, 295200, 294500, 293000, 293300
2.6.29-rc6 + this patch (5 samples, each in KB/sec):
313050, 3106750, 293350, 306300, 307900
2.6.18
395700, 342000, 399100, 366050, 359850
See w_test() in src/fstime.c in unixbench version 4.1.0. Basically, the above test
consists of counting how much we can write in this manner:
alarm(10);
while (!sigalarm) {
for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
write(f, buf, 256);
}
lseek(f, 0L, 0);
}
Note, there are other components to the write syscall regression
that are not addressed here.
Signed-off-by: Salman Qazi <sqazi@...gle.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@...e.hu>
---
arch/x86/include/asm/uaccess_64.h | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 84210c4..987a2c1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -192,14 +192,26 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
unsigned size)
{
might_sleep();
- return __copy_user_nocache(dst, src, size, 1);
+ /*
+ * In practice this limit means that large file write()s
+ * which get chunked to 4K copies get handled via
+ * non-temporal stores here. Smaller writes get handled
+ * via regular __copy_from_user():
+ */
+ if (likely(size >= PAGE_SIZE))
+ return __copy_user_nocache(dst, src, size, 1);
+ else
+ return __copy_from_user(dst, src, size);
}
static inline int __copy_from_user_inatomic_nocache(void *dst,
const void __user *src,
unsigned size)
{
- return __copy_user_nocache(dst, src, size, 0);
+ if (likely(size >= PAGE_SIZE))
+ return __copy_user_nocache(dst, src, size, 0);
+ else
+ return __copy_from_user_inatomic(dst, src, size);
}
unsigned long
>From 577e10ab54dacde4c9fc2d09adfb7e817cadac83 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@...e.hu>
Date: Tue, 24 Feb 2009 17:38:12 +0100
Subject: [PATCH] x86: add __copy_user_page_nocache() optimized memcpy
Add a hardcoded 4K copy __copy_user_page_nocache() implementation,
used for pagecache copies.
Signed-off-by: Ingo Molnar <mingo@...e.hu>
---
arch/x86/include/asm/uaccess_64.h | 10 +++--
arch/x86/lib/copy_user_nocache_64.S | 84 +++++++++++++++++++++++++++++++++++
2 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 987a2c1..71cbcda 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -188,6 +188,8 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
+extern long __copy_user_page_nocache(void *dst, const void __user *src);
+
static inline int __copy_from_user_nocache(void *dst, const void __user *src,
unsigned size)
{
@@ -198,8 +200,8 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
* non-temporal stores here. Smaller writes get handled
* via regular __copy_from_user():
*/
- if (likely(size >= PAGE_SIZE))
- return __copy_user_nocache(dst, src, size, 1);
+ if (likely(size == PAGE_SIZE))
+ return __copy_user_page_nocache(dst, src);
else
return __copy_from_user(dst, src, size);
}
@@ -208,8 +210,8 @@ static inline int __copy_from_user_inatomic_nocache(void *dst,
const void __user *src,
unsigned size)
{
- if (likely(size >= PAGE_SIZE))
- return __copy_user_nocache(dst, src, size, 0);
+ if (likely(size == PAGE_SIZE))
+ return __copy_user_page_nocache(dst, src);
else
return __copy_from_user_inatomic(dst, src, size);
}
diff --git a/arch/x86/lib/copy_user_nocache_64.S b/arch/x86/lib/copy_user_nocache_64.S
index cb0c112..387f08e 100644
--- a/arch/x86/lib/copy_user_nocache_64.S
+++ b/arch/x86/lib/copy_user_nocache_64.S
@@ -135,3 +135,87 @@ ENTRY(__copy_user_nocache)
.previous
CFI_ENDPROC
ENDPROC(__copy_user_nocache)
+
+/*
+ * copy_user_page_nocache - Uncached memory copy of a single page using
+ * non-temporal stores.
+ *
+ * Used for big 4K sized writes - where the chance of repeat access to
+ * the same data is low.
+ */
+ENTRY(__copy_user_page_nocache)
+ CFI_STARTPROC
+
+ /*
+ * We'll do 64 iterations of 64 bytes each == 4096 bytes,
+ * hardcoded:
+ */
+ movl $64, %ecx
+
+1: movq 0*8(%rsi), %r8
+2: movq 1*8(%rsi), %r9
+3: movq 2*8(%rsi), %r10
+4: movq 3*8(%rsi), %r11
+
+5: movnti %r8 0*8(%rdi)
+6: movnti %r9, 1*8(%rdi)
+7: movnti %r10, 2*8(%rdi)
+8: movnti %r11, 3*8(%rdi)
+
+9: movq 4*8(%rsi), %r8
+10: movq 5*8(%rsi), %r9
+11: movq 6*8(%rsi), %r10
+12: movq 7*8(%rsi), %r11
+
+13: movnti %r8, 4*8(%rdi)
+14: movnti %r9, 5*8(%rdi)
+15: movnti %r10, 6*8(%rdi)
+16: movnti %r11, 7*8(%rdi)
+
+ leaq 64(%rsi),%rsi
+ leaq 64(%rdi),%rdi
+
+ decl %ecx
+ jnz 1b
+
+ sfence
+
+ ret
+
+ .section .fixup,"ax"
+30: shll $6,%ecx
+ movl %ecx,%edx
+ jmp 60f
+40: xorl %edx, %edx
+ lea (%rdx,%rcx,8),%rdx
+ jmp 60f
+50: movl %ecx,%edx
+60: sfence
+ jmp copy_user_handle_tail
+ .previous
+
+ .section __ex_table,"a"
+ .quad 1b,30b
+ .quad 2b,30b
+ .quad 3b,30b
+ .quad 4b,30b
+ .quad 5b,30b
+ .quad 6b,30b
+ .quad 7b,30b
+ .quad 8b,30b
+ .quad 9b,30b
+ .quad 10b,30b
+ .quad 11b,30b
+ .quad 12b,30b
+ .quad 13b,30b
+ .quad 14b,30b
+ .quad 15b,30b
+ .quad 16b,30b
+ .quad 18b,40b
+ .quad 19b,40b
+ .quad 21b,50b
+ .quad 22b,50b
+ .previous
+
+ CFI_ENDPROC
+ENDPROC(__copy_user_page_nocache)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists