[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVMF140-XZLi9nARPeAmyi8QRRzLRm6SSpB0BkAUSGakg@mail.gmail.com>
Date: Tue, 20 Mar 2018 14:57:05 +0000
From: Andy Lutomirski <luto@...nel.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
David Laight <David.Laight@...lab.com>,
Rahul Lakkireddy <rahul.lakkireddy@...lsio.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"hpa@...or.com" <hpa@...or.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"ganeshgr@...lsio.com" <ganeshgr@...lsio.com>,
"nirranjan@...lsio.com" <nirranjan@...lsio.com>,
"indranil@...lsio.com" <indranil@...lsio.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Fenghua Yu <fenghua.yu@...el.com>,
Eric Biggers <ebiggers3@...il.com>,
Rik van Riel <riel@...hat.com>
Subject: Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> * Thomas Gleixner <tglx@...utronix.de> wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>
I think this kinda sorts works, but only kinda sorta:
- I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs. (IIRC
that's how it works for the new VEX stuff.)
- This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects. I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past. This might not be a
total showstopper, but it's odd.
I'd rather see us finally finish the work that Rik started to rework
this differently. I'd like kernel_fpu_begin() to look like:
if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
return; // we're already okay. maybe we need to check
in_interrupt() or something, though?
} else {
XSAVES/XSAVEOPT/XSAVE;
set_thread_flag(TIF_NEED_FPU_RESTORE):
}
and kernel_fpu_end() does nothing at all.
We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.
As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.
--Andy
Powered by blists - more mailing lists