[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <38f1d91a44c243a91e441a947fed4b076dcd4ca1.1311587947.git.luto@mit.edu>
Date: Mon, 25 Jul 2011 06:05:47 -0400
From: Andy Lutomirski <luto@....EDU>
To: Ingo Molnar <mingo@...e.hu>
Cc: x86 <x86@...nel.org>, linux-kernel@...r.kernel.org,
Andy Lutomirski <luto@....edu>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Arjan van de Ven <arjan@...radead.org>,
Avi Kivity <avi@...hat.com>
Subject: [PATCH 3.1?] x86: Remove useless stts/clts pair in __switch_to
An stts/clts pair takes over 70 ns by itself on Sandy Bridge, and
when other things are going on it's apparently even worse. This
saves 10% on context switches between threads that both use extended
state.
Signed-off-by: Andy Lutomirski <luto@....edu>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Arjan van de Ven <arjan@...radead.org>,
Cc: Avi Kivity <avi@...hat.com>
---
This is not as well tested as it should be (especially on 32-bit, where
I haven't actually tried compiling it), but I think this might be 3.1
material so I want to get it out for review before it's even more
unjustifiably late :)
Argument for inclusion in 3.1 (after a bit more testing):
- It's dead simple.
- It's a 10% speedup on context switching under the right conditions [1]
- It's unlikely to slow any workload down, since it doesn't add any work
anywwhere.
Argument against:
- It's late.
[1] https://gitorious.org/linux-test-utils/linux-clock-tests/blobs/master/context_switch_latency.c
arch/x86/include/asm/i387.h | 10 ++++++++++
arch/x86/kernel/process_32.c | 10 ++++------
arch/x86/kernel/process_64.c | 7 +++----
3 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..9d2d08b 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -295,6 +295,16 @@ static inline void __unlazy_fpu(struct task_struct *tsk)
tsk->fpu_counter = 0;
}
+static inline void __unlazy_fpu_clts(struct task_struct *tsk)
+{
+ if (task_thread_info(tsk)->status & TS_USEDFPU) {
+ __save_init_fpu(tsk);
+ } else {
+ tsk->fpu_counter = 0;
+ clts();
+ }
+}
+
static inline void __clear_fpu(struct task_struct *tsk)
{
if (task_thread_info(tsk)->status & TS_USEDFPU) {
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index a3d0dc5..c707741 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -304,7 +304,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*/
preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
- __unlazy_fpu(prev_p);
+ if (preload_fpu)
+ __unlazy_fpu_clts(prev_p);
+ else
+ __unlazy_fpu(prev_p);
/* we're going to use this soon, after a few expensive things */
if (preload_fpu)
@@ -348,11 +351,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
__switch_to_xtra(prev_p, next_p, tss);
- /* If we're going to preload the fpu context, make sure clts
- is run while we're batching the cpu state updates. */
- if (preload_fpu)
- clts();
-
/*
* Leave lazy mode, flushing any hypercalls made here.
* This must be done before restoring TLS segments so
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b1f3f53..272bddd 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -419,11 +419,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
load_TLS(next, cpu);
/* Must be after DS reload */
- __unlazy_fpu(prev_p);
-
- /* Make sure cpu is ready for new context */
if (preload_fpu)
- clts();
+ __unlazy_fpu_clts(prev_p);
+ else
+ __unlazy_fpu(prev_p);
/*
* Leave lazy mode, flushing any hypercalls made here.
--
1.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists