linux-kernel - Re: [PATCH v2] crypto: crc32c-pclmul

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	28 May 2014 19:55:16 -0400
From:	"George Spelvin" <linux@...izon.com>
To:	linux@...izon.com, tim.c.chen@...ux.intel.com
Cc:	david.m.cote@...el.com, herbert@...dor.apana.org.au,
	james.guilford@...el.com, JBeulich@...e.com,
	linux-kernel@...r.kernel.org, sandyw@...tter.com,
	wajdi.k.feghali@...el.com
Subject: Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words

> Can you do a tcrypt speed measurement with and without your changes?
> Check to see if there's any slowdown.  Please make sure you pin
> the frequency of your cpu when running the test.  

Sure thing; I was already inspired to do that based on your concerns.
Do you have any particular buffer sizes or alignments you'd suggest?

Since I'm changing only the three-part core, I was going to
avoid unaligned or short buffers, stick with a single buffer so
it stays in L1 D-cache, but vary the length so we use lots of
the K_table.

It's not the RAM I was worried about, but the D-cache wasted on
on the K table.  Which doesn't affect the CRC code itself, but the
surrounding kernel code.

I'm also thinking of some ideas for handling even larger buffer sizes
without having to interrupt the 3-way main loop.  Pclmulqdq can
mutiply up to 4 32-bit values to produce a 128-bit result, which
crc32 can efficiently reduce.  So if we have three tables, of
x^(64*n) x^(4096*n), and x^(262144*n), each for n=0..63, we can
multiply them all together to handle up to a 16 MiB chunk.

The other option is to schedule the pclmulqdq in parallel with
the crc32q iterations and, after arranging a staggered start,
have a 4-part main loop, where 3 parts are performing crc32q
iterations and the fourth is using SSE to shift itself
forward (at which point it gets XORed into the data stream
that one other part is working on).

I haven't got all the details of that idea worked out in my head, but
it seems possible.  I have to study the optimization guide in detail to
see how many micro-ops the crc32q instruction from memory is (and thus
how much of the decoder it requires).

As of Nehalem, a small inner loop that fits in the decoded uop cache
has the potential to be faster than a hugely unrolled one.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/