[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1382384645.3284.86.camel@edumazet-glaptop.roam.corp.google.com>
Date: Mon, 21 Oct 2013 12:44:05 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Neil Horman <nhorman@...driver.com>
Cc: Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu. Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.
This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)
>
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
>
> AVG = 492 ns
>
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
>
> AVG = 488 ns
>
>
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
>
> AVG = 449 ns
>
>
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
>
> AVG = 438 ns
>
>
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).
>
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality. Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
>
> Unless you see something else that I'm doing wrong here. It seems like a win to
> do both.
>
Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)
Only noises in the results.
It seems it depends on cpus and/or multiple factors.
Last machine I used for the tests had :
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
stepping : 2
microcode : 0x13
cpu MHz : 2800.256
cache size : 12288 KB
physical id : 1
siblings : 12
core id : 10
cpu cores : 6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists