[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180320082651.jmxvvii2xvmpyr2s@gmail.com>
Date: Tue, 20 Mar 2018 09:26:51 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: David Laight <David.Laight@...LAB.COM>,
'Rahul Lakkireddy' <rahul.lakkireddy@...lsio.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"hpa@...or.com" <hpa@...or.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"ganeshgr@...lsio.com" <ganeshgr@...lsio.com>,
"nirranjan@...lsio.com" <nirranjan@...lsio.com>,
"indranil@...lsio.com" <indranil@...lsio.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Thomas Gleixner <tglx@...utronix.de>,
Fenghua Yu <fenghua.yu@...el.com>,
Eric Biggers <ebiggers3@...il.com>
Subject: Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
* Thomas Gleixner <tglx@...utronix.de> wrote:
> > Useful also for code that needs AVX-like registers to do things like CRCs.
>
> x86/crypto/ has a lot of AVX optimized code.
Yeah, that's true, but the crypto code is processing fundamentally bigger blocks
of data, which amortizes the cost of using kernel_fpu_begin()/_end().
kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full FPU
save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy a
thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter for
something small, like a single 256-bit or 512-bit word access.
But there's actually a new thing in modern kernels: we got rid of (most of) lazy
save/restore FPU code, our new x86 FPU model is very "direct" with no FPU faults
taken normally.
So assuming the target driver will only load on modern FPUs I *think* it should
actually be possible to do something like (pseudocode):
vmovdqa %ymm0, 40(%rsp)
vmovdqa %ymm1, 80(%rsp)
...
# use ymm0 and ymm1
...
vmovdqa 80(%rsp), %ymm1
vmovdqa 40(%rsp), %ymm0
... without using the heavy XSAVE/XRSTOR instructions.
Note that preemption probably still needs to be disabled and possibly there are
other details as well, but there should be no 'heavy' FPU operations.
I think this should still preserve all user-space FPU state and shouldn't muck up
any 'weird' user-space FPU state (such as pending exceptions, legacy x87 running
code, NaN registers or weird FPU control word settings) we might have interrupted
either.
But I could be wrong, it should be checked whether this sequence is safe.
Worst-case we might have to save/restore the FPU control and tag words - but those
operations should still be much faster than a full XSAVE/XRSTOR pair.
So I do think we could do more in this area to improve driver performance, if the
code is correct and if there's actual benchmarks that are showing real benefits.
Thanks,
Ingo
Powered by blists - more mailing lists