lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180321074634.dzpyjz3ia46snodh@gmail.com>
Date:   Wed, 21 Mar 2018 08:46:34 +0100
From:   Ingo Molnar <mingo@...nel.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        David Laight <David.Laight@...lab.com>,
        Rahul Lakkireddy <rahul.lakkireddy@...lsio.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "ganeshgr@...lsio.com" <ganeshgr@...lsio.com>,
        "nirranjan@...lsio.com" <nirranjan@...lsio.com>,
        "indranil@...lsio.com" <indranil@...lsio.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Eric Biggers <ebiggers3@...il.com>
Subject: Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access


So I poked around a bit and I'm having second thoughts:

* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar <mingo@...nel.org> wrote:
> >
> > So assuming the target driver will only load on modern FPUs I *think* it should
> > actually be possible to do something like (pseudocode):
> >
> >         vmovdqa %ymm0, 40(%rsp)
> >         vmovdqa %ymm1, 80(%rsp)
> >
> >         ...
> >         # use ymm0 and ymm1
> >         ...
> >
> >         vmovdqa 80(%rsp), %ymm1
> >         vmovdqa 40(%rsp), %ymm0
> >
> > ... without using the heavy XSAVE/XRSTOR instructions.
> 
> No. The above is buggy. It may *work*, but it won't work in the long run.
> 
> Pretty much every single vector extension has traditionally meant that
> touching "old" registers breaks any new register use. Even if you save
> the registers carefully like in your example code, it will do magic
> and random things to the "future extended" version.

This should be relatively straightforward to solve via a proper CPU features 
check: for example by only patching in the AVX routines for 'known compatible' 
fpu_kernel_xstate_size values. Future extensions of register width will extend
the XSAVE area.

It's not fool-proof: in theory there could be semantic extensions to the vector 
unit that does not increase the size of the context - but the normal pattern is to 
increase the number of XINUSE bits and bump up the maximum context area size.

If that's a worry then an even safer compatibility check would be to explicitly 
list CPU models - we do track them pretty accurately anyway these days, mostly due 
to perf PMU support defaulting to a safe but dumb variant if a CPU model is not 
specifically listed.

That method, although more maintenance-intense, should be pretty fool-proof 
AFAICS.

> So I absolutely *refuse* to have anything to do with the vector unit.
> You can only touch it in the kernel if you own it entirely (ie that
> "kernel_fpu_begin()/_end()" thing). Anything else is absolutely
> guaranteed to cause problems down the line.
> 
> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.

So I added a bit of instrumentation and the current state of things is that on 
64-bit x86 every single task has an initialized FPU, every task has the exact 
same, fully filled in xfeatures (XINUSE) value:

 [root@...atea ~]# grep -h fpu /proc/*/task/*/fpu | sort | uniq -c
    504 x86/fpu: initialized                             :                    1
    504 x86/fpu: xfeatures_mask                          :                    7

So our latest FPU model is *really* simple and user-space should not be able to 
observe any changes in the XINUSE bits of the XSAVE header, because (at least for 
the basic vector CPU features) all bits are maxed out all the time.

Note that this is with an AVX (128-bit) supporting CPU:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.

But note that it probably wouldn't make sense to make use of XINUSE optimizations 
on most systems for the AVX space, as glibc will use the highest-bitness vector 
ops for its regular memcpy(), and every user task makes use of memcpy.

It does make sense for some of the more optional XSAVE based features such as 
pkeys. But I don't have any newer Intel system with a wider xsave feature set to 
check.

> So no. Don't use vector stuff in the kernel. It's not worth the pain.

That might still be true, but still I'm torn:

 - Broad areas of user-space has seemlessly integrated vector ops and is using 
   them all the time they can find an excuse to use them.

 - The vector registers are fundamentally callee-saved, so in most synchronous 
   calls the vector unit registers are unused. Asynchronous interruptions of 
   context (interrupts, faults, preemption, etc.) can still use them as well, as 
   long as they save/restore register contents.

So other than Intel not making it particularly easy to make a forwards compatible 
vector register granular save/restore pattern (but see above for how we could 
handle that) for asynchronous contexts, I don't see too many other complications.

Thanks,

	Ingo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ