linux-kernel - Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.11.1502212328210.11588@eddie.linux-mips.org>
Date:	Sun, 22 Feb 2015 00:34:08 +0000 (GMT)
From:	"Maciej W. Rozycki" <macro@...ux-mips.org>
To:	Borislav Petkov <bp@...en8.de>
cc:	Ingo Molnar <mingo@...nel.org>,
	Andy Lutomirski <luto@...capital.net>,
	Oleg Nesterov <oleg@...hat.com>,
	Rik van Riel <riel@...hat.com>, x86@...nel.org,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

On Sat, 21 Feb 2015, Borislav Petkov wrote:

> Provided I've not made a mistake, this leads me to think that this
> simple workload and pretty much everything else uses the FPU through
> glibc which does the SSE memcpy and so on. Which basically kills the
> whole idea behind lazy FPU as practically you don't really encounter
> workloads nowadays which don't use the FPU thanks to glibc and the lazy
> strategy doesn't really bring anything.
> 
> Which would then mean, we don't really need the lazy handling as
> userspace is making it eager, so to speak, for us.

 Please correct me if I'm wrong, but it looks to me like you're confusing 
lazy FPU context allocation and lazy FPU context switching.  These build 
on the same hardware principles, but they are different concepts.

 Your "userspace is making it eager" statement in the context of glibc 
using SSE for `memcpy' is certainly true for lazy FPU context allocation, 
however I wouldn't be so sure about lazy FPU context switching, and a 
kernel compilation (or in fact any compilation) does not appear to be a 
representative benchmark to me.  I am sure lots of software won't be 
calling `memcpy' all the time, there should be context switches between 
which the FPU is not referred to at all.

 Also, does `__builtin_memcpy' also expand to SSE?  I'd expect it rather 
than external `memcpy' to be used by GCC for copying fixed amounts of 
data, especially smaller ones such as when passing structures by value in 
function calls or for string operations like `strdup' or suchlike.  These 
I'd expect to be ubiquitous, whereas external `memcpy' I'd expect to be 
called from time to time only.

 Additionally I believe long-executing FPU instructions (i.e. 
transcendentals) can take advantage of continuing to execute in parallel 
where the context has already been switched rather than stalling an eager 
FPU context switch until the FPU instruction has completed.

 And last but not least, why does the handling of CR0.TS traps have to be 
complicated?  It does not look like rocket science to me, it should be a 
mere handful of instructions, the time required to move the two FP 
contexts out from and in to the FPU respectively should dominate 
processing time.  Where quoted the optimisation manual states 250 cycles 
for FXSAVE and FXRSTOR combined.

 And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at 
bootstrap depending on processor features, you don't have to do all the 
run-time check on every trap.  You can even optimise the FSAVE handler 
away at the build time if you know it won't ever be used based on the 
minimal supported processor family selected.

 Do you happen to know or can determine how much time (in clock cycles) a 
CR0.TS trap itself takes, including any time required to preserve the 
execution state in the handler such as pushing/popping GPRs to/from the 
stack (as opposed to processing time spent on moving the FP contexts back 
and forth)?  Is there no room for improvement left there?  How many task 
scheduling slots say per million must be there poking at the FPU for eager 
FPU context switching to take advantage over lazy one?

  Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/