linux-kernel - Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b6c65074-a35e-4c50-8e97-096e48e0c506@email.android.com>
Date:	Fri, 18 Oct 2013 10:09:54 -0700
From:	"H. Peter Anvin" <hpa@...or.com>
To:	Neil Horman <nhorman@...driver.com>
CC:	linux-kernel@...r.kernel.org, sebastien.dugue@...l.net,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.

Neil Horman <nhorman@...driver.com> wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices implementing ipoib
>(which don't have
>> > checksum offload hardware were spending a significant amount of
>time computing
>> > checksums.  We found that by splitting the checksum computation
>into two
>> > separate streams, each skipping successive elements of the buffer
>being summed,
>> > we could parallelize the checksum operation accros multiple alus. 
>Since neither
>> > chain is dependent on the result of the other, we get a speedup in
>execution (on
>> > hardware that has multiple alu's available, which is almost
>ubiquitous on x86),
>> > and only a negligible decrease on hardware that has only a single
>alu (an extra
>> > addition is introduced).  Since addition in commutative, the result
>is the same,
>> > only faster
>> 
>> On hardware that implement ADCX/ADOX then you should also be able to
>> have additional streams interleaved since those instructions allow
>for
>> dual carry chains.
>> 
>> 	-hpa
>> 
>I've been looking into this a bit more, and I'm a bit confused. 
>According to
>this:
>http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
>
>by my read, this pair of instructions simply supports 2 carry bit
>chains,
>allowing for two parallel execution paths through the cpu that won't
>block on
>one another.  Its exactly the same as whats being done with the
>universally
>available addcq instruction, so theres no real speedup (that I can
>see).  Since
>we'd either have to use the alternatives macro to support adcx/adox
>here or the
>old instruction set, it seems not overly worth the effort to support
>the
>extension.  
>
>Or am I missing something?
>
>Neil

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/