linux-kernel - Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1382384645.3284.86.camel@edumazet-glaptop.roam.corp.google.com>
Date:	Mon, 21 Oct 2013 12:44:05 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Neil Horman <nhorman@...driver.com>
Cc:	Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
	sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:

> 
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu.  Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.

This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)

> 
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
> 
> AVG = 492 ns
> 
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
> 
> AVG = 488 ns
> 
> 
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
> 
> AVG = 449 ns
> 
> 
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
> 
> AVG = 438 ns
> 
> 
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).  
> 
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality.  Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
> 
> Unless you see something else that I'm doing wrong here.  It seems like a win to
> do both.
> 

Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)

Only noises in the results.

It seems it depends on cpus and/or multiple factors.

Last machine I used for the tests had :

processor	: 23
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping	: 2
microcode	: 0x13
cpu MHz		: 2800.256
cache size	: 12288 KB
physical id	: 1
siblings	: 12
core id		: 10
cpu cores	: 6



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/