lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 11 Jan 2015 16:46:22 -0500
From:	riel@...hat.com
To:	linux-kernel@...r.kernel.org
Cc:	mingo@...hat.com, hpa@...or.com, matt.fleming@...el.com,
	bp@...e.de, oleg@...hat.com, pbonzini@...hat.com,
	tglx@...utronix.de, luto@...capital.net
Subject: [RFC PATCH 0/11 BROKEN] move FPU context loading to userspace switch

Currently the kernel will always load the FPU context, even
when switching to a kernel thread, or to an idle thread. In
the case of a task on a KVM VCPU going idle for a bit, and
waking up again later, this creates a vastly inefficient
chain of FPU context saves & loads:

1) save task FPU context, load idle task FPU context (in KVM guest)
2) trap to host
3) save VCPU guest FPU context, load VCPU userspace context (__kernel_fpu_end)
4) save VCPU userspace context, load idle thread FPU context
5) save idle thread FPU context, load VCPU userspace FPU context
6) save VCPU userspace FPU context, load guest FPU context (__kernel_fpu_begin)
7) enter guest
8) save idle task FPU context, load task FPU context (in KVM guest)

This is a total of 6 FPU saves and 6 restores, touching 4 different
FPU contexts, only one of which is ever used. The hardware optimizes
FPU load and restore pretty well, but 12 operations involving 384
bytes of data adds substantial overhead. Additionally, the XSTOROPT
optimization does not work across VMENTER / VMEXIT boundaries, so
things are slower than they would be on bare metal.

This patch series reduces it to two saves (1) and (3), and one load
(6), if the VCPU and the task inside the guest both stay on the same
CPU. The load could be optimized away in a subsequent series, by
recognizing that the emulator did not touch the in-memory FPU state
for the guest.

This could also give a small performance gain for bare metal
applications that wake up and go idle repeatedly, staying on the
same CPU.

Where it all falls apart (probably due to a stupid mistake on my end)
is the signal handling code.

In the signal handling code, the registers (including FPU state) are
all saved to the user space stack, and on sigreturn they are loaded
back in. The signal handler setup code needs to be fixed to deal with
the other changes, but I am apparently doing that incorrectly.

I have been staring at the code for a few weeks now, and do not
appear to be any closer to figuring out what I did wrong in the last
patch of this series.

I would really appreciate it if people with better knowledge of the
signal handler and/or FPU code could take a look :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ