linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAObL_7HPssPSaMMKdc-9LFKQXUnreFxMy4g8g2b7KdiRkBPW7w@mail.gmail.com>
Date:	Mon, 15 Aug 2011 15:11:40 -0400
From:	Andrew Lutomirski <luto@....edu>
To:	Borislav Petkov <bp@...en8.de>
Cc:	melwyn lobo <linux.melwyn@...il.com>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Ingo Molnar <mingo@...e.hu>, linux-kernel@...r.kernel.org,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	borislav.petkov@....com
Subject: Re: x86 memcpy performance

On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <bp@...en8.de> wrote:
> On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>>> Or, if we want to use SSE stuff in the kernel, we might think of
>>> allocating its own FPU context(s) and handle those...
>>
>> I'm thinking of having a stack of FPU states to parallel irq stacks
>> and IST stacks.
>
> ... I'm guessing with the same nesting as hardirqs? Making FPU
> instructions usable in irq contexts too.
>
>> It gets a little hairy when code inside kernel_fpu_begin traps for a
>> non-irq non-IST reason, though.
>
> How does that happen? You're in the kernel with preemption disabled and
> TS cleared, what would cause the #NM? I think that if you need to switch
> context, you simply "push" the current FPU context, allocate a new one
> and clts as part of the FPU context switching, no?

Not #NM, but page faults can happen too (even just accessing vmalloc space).

>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.

I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.

>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>

Yes.  But we don't nest that much, and the save/restore isn't all that
expensive.  And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.

This whole project may take awhile.  The code in there is a
poorly-documented mess, even after Hans' cleanups.  (It's a lot worse
without them, though.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/