[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3175fc05afa1c2b0defd935d0fec22d5.squirrel@www.skyhub.de>
Date: Mon, 15 Aug 2011 20:49:56 +0200 (CEST)
From: "Borislav Petkov" <bp@...en8.de>
To: "Andrew Lutomirski" <luto@....edu>
Cc: "Borislav Petkov" <bp@...en8.de>,
"melwyn lobo" <linux.melwyn@...il.com>,
"Denys Vlasenko" <vda.linux@...glemail.com>,
"Ingo Molnar" <mingo@...e.hu>, linux-kernel@...r.kernel.org,
"H. Peter Anvin" <hpa@...or.com>,
"Thomas Gleixner" <tglx@...utronix.de>,
"Linus Torvalds" <torvalds@...ux-foundation.org>,
"Peter Zijlstra" <a.p.zijlstra@...llo.nl>, borislav.petkov@....com
Subject: Re: x86 memcpy performance
On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
>> This would obviate the need to muck with contexts but that could get
>> expensive wrt stack operations. The advantage is that I'm not dealing
>> with the whole FPU state but only with 16 XMM regs. I should probably
>> dust off that version again and retest.
>
> I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
> 80 ns and a full state save+restore is only ~60 ns.
> Without infrastructure changes, I don't think you can avoid the clts
> and stts.
Yeah, probably.
> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.
That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.
>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.
... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.
> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.
How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?
> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).
Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.
> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.
Yep.
> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.
But you'd need to save each kernel FPU state when nesting, no?
>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.
Exactly my point - I think we should do it only when it's really worth
the trouble.
--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists