[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150529184455.GA27501@gmail.com>
Date: Fri, 29 May 2015 20:44:56 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Andy Lutomirski <luto@...capital.net>
Cc: Dave Hansen <dave@...1.net>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
X86 ML <x86@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Oleg Nesterov <oleg@...hat.com>,
Borislav Petkov <bp@...en8.de>, Rik van Riel <riel@...hat.com>,
Suresh Siddha <sbsiddha@...il.com>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>,
Fenghua Yu <fenghua.yu@...el.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
* Andy Lutomirski <luto@...capital.net> wrote:
> > It's not that simple, because the decision is not 'lazy versus eager', but
> > 'mixed lazy/eager versus eager-only':
> >
> > Even on modern machines, if a task is not using the FPU (it's doing integer
> > only work, with short sleeps just shuffling around requests, etc.) then
> > context switches get up to 5-10% faster with lazy FPU restores.
>
> That's only sort of true. I'd believe that a context switch between two lazy
> tasks is 5-10% faster than a context switch between two eager tasks. I bet that
> a context switch between a lazy task and an eager task is a whole lot slower
> than a context switch between two eager tasks because manipulating CR0.TS is
> incredibly slow on all modern CPUs AFAICT. It's even worse in a VM guest.
>
> In other words, with lazy restore, we save the XRSTOR(S) and possibly a
> subsequent XSAVEOPT/XSAVES, but the cost is a MOV to CR0 and possibly a CLTS,
> and the MOV to CR0 is much, much slower than even a worst-case XRSTOR(S). In
> the worst lazy-restore case, we also pay a full exception roundtrip, and
> everything pales in comparison. If we're a guest, then there's probably a
> handful of exits thrown in for good measure.
>
> For true integer-only tasks, I think we should instead convince glibc to add
> things like vzeroall in convenient places to force as much xstate as possible to
> the init state, thus speeding up the optimized save/restore variants.
>
> I think the fundamental issue here is that CPU designers care about xstate
> save/restore/optimize performance, but they don't care at all about TS
> performance, so TS manipulations are probably microcoded and serializing.
That's definitely true.
Btw., potentially being able to get rid of lazy restores was why I wrote the
FPU-benchmarking code, and it gives these results on reasonably recent Intel CPUs:
CR0 reads are reasonably fast:
[ 0.519287] x86/fpu: Cost of: CR0 read : 4 cycles
but we can cache that so it doesn't help us.
writes are bad:
[ 0.528643] x86/fpu: Cost of: CR0 write : 208 cycles
and we cannot cache it, so that hurts us.
and a CR0::TS fault cost is horrible:
[ 0.538042] x86/fpu: Cost of: CR0::TS fault : 1156 cycles
and this is hurting us too.
Since the first version I have extended the benchmark with a cache-cold column as
well - in the cache cold case the difference is even more striking, and in may
cases context switches are cache cold.
Interestingly, this kind of high cost of CR0 related accesses is true even on
pretty old, 10+ years old x86 CPUs, per my measurements, so it's not limited to
modern x86 microarchitectures.
So yes, it would be nice to standardize on synchronous context switching of all
CPU state.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists