[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAObL_7HPssPSaMMKdc-9LFKQXUnreFxMy4g8g2b7KdiRkBPW7w@mail.gmail.com>
Date: Mon, 15 Aug 2011 15:11:40 -0400
From: Andrew Lutomirski <luto@....edu>
To: Borislav Petkov <bp@...en8.de>
Cc: melwyn lobo <linux.melwyn@...il.com>,
Denys Vlasenko <vda.linux@...glemail.com>,
Ingo Molnar <mingo@...e.hu>, linux-kernel@...r.kernel.org,
"H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
borislav.petkov@....com
Subject: Re: x86 memcpy performance
On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <bp@...en8.de> wrote:
> On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>>> Or, if we want to use SSE stuff in the kernel, we might think of
>>> allocating its own FPU context(s) and handle those...
>>
>> I'm thinking of having a stack of FPU states to parallel irq stacks
>> and IST stacks.
>
> ... I'm guessing with the same nesting as hardirqs? Making FPU
> instructions usable in irq contexts too.
>
>> It gets a little hairy when code inside kernel_fpu_begin traps for a
>> non-irq non-IST reason, though.
>
> How does that happen? You're in the kernel with preemption disabled and
> TS cleared, what would cause the #NM? I think that if you need to switch
> context, you simply "push" the current FPU context, allocate a new one
> and clts as part of the FPU context switching, no?
Not #NM, but page faults can happen too (even just accessing vmalloc space).
>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.
I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.
>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>
Yes. But we don't nest that much, and the save/restore isn't all that
expensive. And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.
This whole project may take awhile. The code in there is a
poorly-documented mess, even after Hans' cleanups. (It's a lot worse
without them, though.)
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists