linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHSGOuvJsJr3r+RwM0+733BQJO6hsK0iP0ZNzRoycWX3YHwE0A@mail.gmail.com>
Date:	Mon, 5 Dec 2011 18:24:29 +0530
From:	melwyn lobo <linux.melwyn@...il.com>
To:	Maarten Lankhorst <m.b.lankhorst@...il.com>
Cc:	Borislav Petkov <bp@...64.org>,
	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
	Borislav Petkov <bp@...en8.de>, Ingo Molnar <mingo@...e.hu>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance

Will AVX work on Intel ATOM. I guess not. Then is this now not the
time for having architecture dependant definitions for basic cpu
intensive tasks


On Thu, Sep 1, 2011 at 8:45 PM, Maarten Lankhorst
<m.b.lankhorst@...il.com> wrote:
> Hey,
>
> 2011/8/16 Borislav Petkov <bp@...64.org>:
>> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@...edu wrote:
>>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>>
>>> > Benchmarking with 10000 iterations, average results:
>>> > size    XM              MM              speedup
>>> > 119     540.58          449.491         0.8314969419
>>>
>>> > 12273   2307.86         4042.88         1.751787902
>>> > 13924   2431.8          4224.48         1.737184756
>>> > 14335   2469.4          4218.82         1.708440514
>>> > 15018 2675.67         1904.07         0.711622886
>>> > 16374   2989.75         5296.26         1.771470902
>>> > 24564   4262.15         7696.86         1.805863077
>>> > 27852   4362.53         3347.72         0.7673805572
>>> > 28672   5122.8          7113.14         1.388524413
>>> > 30033   4874.62         8740.04         1.792967931
>>>
>>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>>> really good about this till we understand what happened for those two cases.
>>
>> Yep.
>>
>>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>>> rigging took proper note of hot/cold cache issues. That *may* explain
>>> the two oddball results we see above - but not knowing more about how
>>> it was benched, it's hard to say.
>>
>> Yeah, the more scrutiny this gets the better. So I've cleaned up my
>> setup and have attached it.
>>
>> xm_mem.c does the benchmarking and in bench_memcpy() there's the
>> sse_memcpy call which is the SSE memcpy implementation using inline asm.
>> It looks like gcc produces pretty crappy code here because if I replace
>> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
>> same function but in pure asm - I get much better numbers, sometimes
>> even over 2x. It all depends on the alignment of the buffers though.
>> Also, those numbers don't include the context saving/restoring which the
>> kernel does for us.
>>
>> 7491    1509.89         2346.94         1.554378381
>> 8170    2166.81         2857.78         1.318890326
>> 12277   2659.03         4179.31         1.571744176
>> 13907   2571.24         4125.7          1.604558427
>> 14319   2638.74         5799.67         2.19789466      <----
>> 14993   2752.42         4413.85         1.603625603
>> 16371   3479.11         5562.65         1.59887055
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.
>
> Those numbers you posted aren't right it seems. It depends a lot on the alignment,
> for example if both are aligned to 64 relative to each other,
> kernel memcpy will win from avx memcpy on my machine.
>
> I replaced the malloc calls with memalign(65536, size + 256) so I could toy
> around with the alignments a little. This explains why for some sizes, kernel
> memcpy was faster than sse memcpy in the test results you had.
> When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
> avx memcpy might.
>
> If you want to speed up memcpy, I think your best bet is to find out why it's
> so much slower when src and dst aren't 64-byte aligned compared to each other.
>
> Cheers,
> Maarten
>
> ---
> Attached: my modified version of the sse memcpy you posted.
>
> I changed it a bit, and used avx, but some of the other changes might
> be better for your sse memcpy too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/