linux-kernel - Re: Lazy FPU restoration / moving kernel_fpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHmME9qBrmD3HJtXHiLYjxhrX=ZL2jNr9TRQayB6SyJcZLmgHg@mail.gmail.com>
Date:   Fri, 15 Jun 2018 22:27:39 +0200
From:   "Jason A. Donenfeld" <Jason@...c4.com>
To:     Andrew Lutomirski <luto@...nel.org>
Cc:     dave.hansen@...ux.intel.com, riel@...riel.com,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>
Subject: Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch

On Fri, Jun 15, 2018 at 8:53 PM Andy Lutomirski <luto@...nel.org> wrote:
>
> On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> <dave.hansen@...ux.intel.com> wrote:
> Even with the modified optimization, kernel_fpu_end() still needs to
> reload the state that was trashed by the kernel FPU use.  If the
> kernel is using something like AVX512 state, then kernel_fpu_end()
> will transfer an enormous amount of data no matter how clever the CPU
> is.  And I think I once measured XSAVEOPT taking a hundred cycles or
> so even when RFBM==0, so it's not exactly super fast.

Indeed the speed up is really significant, especially for the AVX512
case. Here are some numbers from my laptop and a server taken a few
seconds ago:

AVX2 - Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Inside: 684617437
Outside: 547710093
Percent speedup: 24

AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
Inside: 634415672
Outside: 286698960
Percent speedup: 121

This is from this test -- https://xn--4db.cc/F7RF2fhv/c . There are
probably various issues with that test case, and it's possible there
are other effects going on (the avx512 case looks particularly insane)
to make the difference _that_ drastic, but I think there's no doubt
that the optimization here is a meaningful one.