netdev - RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Mon, 4 Nov 2013 09:47:35 -0000
From:	"David Laight" <David.Laight@...LAB.COM>
To:	"Neil Horman" <nhorman@...driver.com>
Cc:	"Ben Hutchings" <bhutchings@...arflare.com>,
	"Doug Ledford" <dledford@...hat.com>,
	"Ingo Molnar" <mingo@...nel.org>,
	"Eric Dumazet" <eric.dumazet@...il.com>,
	<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>
Subject: RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!

Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).

> I agree, that sounds interesting, but very cpu dependent.  Thanks for the
> suggestion, Ben, but I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html