linux-kernel - [RFC PATCH 0/11 BROKEN] move FPU context loading to userspace switch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <1421012793-30106-1-git-send-email-riel@redhat.com>
Date:	Sun, 11 Jan 2015 16:46:22 -0500
From:	riel@...hat.com
To:	linux-kernel@...r.kernel.org
Cc:	mingo@...hat.com, hpa@...or.com, matt.fleming@...el.com,
	bp@...e.de, oleg@...hat.com, pbonzini@...hat.com,
	tglx@...utronix.de, luto@...capital.net
Subject: [RFC PATCH 0/11 BROKEN] move FPU context loading to userspace switch

Currently the kernel will always load the FPU context, even
when switching to a kernel thread, or to an idle thread. In
the case of a task on a KVM VCPU going idle for a bit, and
waking up again later, this creates a vastly inefficient
chain of FPU context saves & loads:

1) save task FPU context, load idle task FPU context (in KVM guest)
2) trap to host
3) save VCPU guest FPU context, load VCPU userspace context (__kernel_fpu_end)
4) save VCPU userspace context, load idle thread FPU context
5) save idle thread FPU context, load VCPU userspace FPU context
6) save VCPU userspace FPU context, load guest FPU context (__kernel_fpu_begin)
7) enter guest
8) save idle task FPU context, load task FPU context (in KVM guest)

This is a total of 6 FPU saves and 6 restores, touching 4 different
FPU contexts, only one of which is ever used. The hardware optimizes
FPU load and restore pretty well, but 12 operations involving 384
bytes of data adds substantial overhead. Additionally, the XSTOROPT
optimization does not work across VMENTER / VMEXIT boundaries, so
things are slower than they would be on bare metal.

This patch series reduces it to two saves (1) and (3), and one load
(6), if the VCPU and the task inside the guest both stay on the same
CPU. The load could be optimized away in a subsequent series, by
recognizing that the emulator did not touch the in-memory FPU state
for the guest.

This could also give a small performance gain for bare metal
applications that wake up and go idle repeatedly, staying on the
same CPU.

Where it all falls apart (probably due to a stupid mistake on my end)
is the signal handling code.

In the signal handling code, the registers (including FPU state) are
all saved to the user space stack, and on sigreturn they are loaded
back in. The signal handler setup code needs to be fixed to deal with
the other changes, but I am apparently doing that incorrectly.

I have been staring at the code for a few weeks now, and do not
appear to be any closer to figuring out what I did wrong in the last
patch of this series.

I would really appreciate it if people with better knowledge of the
signal handler and/or FPU code could take a look :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/