lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 5 Oct 2016 18:09:50 +0200
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Rik van Riel <riel@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        X86 ML <x86@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        Andrew Lutomirski <luto@...nel.org>, pa@...or.com,
        Borislav Petkov <bp@...e.de>
Subject: Re: [PATCH 2/9] x86/fpu: Hard-disable lazy fpu mode



On 05/10/2016 17:59, Andy Lutomirski wrote:
> I actually benchmarked the underlying instructions quite a bit on
> Intel.  (Not on AMD, but I doubt the results are very different.)
> Writes to CR0.TS are *incredibly* slow, as are device-not-available
> exceptions.  Keep in mind that, while there's a (slow) CLTS
> instruction, there is no corresponding STTS instruction, so we're left
> with a fully serializing, slowly microcoded move to CR0.  On SVM, I
> think it's worse, because IIRC SVM doesn't have fancy execution
> controls that let MOV to CR0 avoid exiting.

SVM lets you choose whether to trap on TS and MP; update_cr0_intercept
is where KVM does that (the "selective CR0 write" intercept is always
on, while the "CR0 write" intercept is toggled in that function).

> We're talking a couple
> hundred cycles best case for a TS set/clear pair, and thousands of
> cycles if we actually take a fault.
> 
> In contrast, an unconditional XSAVE + XRSTOR was considerably faster.

Did you also do a comparison against FXSAVE/FXRSTOR (on either pre- or
post-SandyBridge processors)?

But yeah, it's possible that the lack of STTS screws the whole plan,
despite the fpu.preload optimization in switch_fpu_prepare.

Paolo

> This leads to the counterintuitive result that, if we switch from task
> A to B and back and task A is heavily using the FPU, then it's faster
> to unconditoinally save and restore the full state both ways than it
> is to set and clear TS so we can avoid it.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ