linux-kernel - RE: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <00c9c7c84e9043689942fc1f36e28591@AcuMS.aculab.com>
Date: Mon, 14 Oct 2024 22:32:48 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Eric Biggers' <ebiggers@...nel.org>
CC: "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"x86@...nel.org" <x86@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Ard Biesheuvel <ardb@...nel.org>, "Josh
 Poimboeuf" <jpoimboe@...nel.org>, Peter Zijlstra <peterz@...radead.org>
Subject: RE: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and
 excessive unrolling

...
> > Do you need to unroll it at all?

> It looks like on most CPUs, no.  On Haswell, Emerald Rapids, Zen 2 it does not
> make a significant difference.  However, it helps on Zen 5.

I wonder if one of the loop instructions is using the ALU
unit you really want to be processing a crc32?
If the cpu has fused arithmetic+jump u-ops then trying to get the
decoder to use one of those may help.

Is Zen 5 actually slower than the other systems?
I've managed to get clock cycle counts using the performance counters
that more of less match the predicted values.
You can't use 'rdtsc' because the cpu frequence isn't stable.

...
> > If you are really lucky you'll get two memory reads/clock.
> > So you won't ever to do than two crc32/clock.
> > Looking at Agner's instruction latency tables I don't think
> > any cpu can do more that one per clock, or pipeline them.
> > I think that means you don't even need two (never mind 3)
> > buffers.
> 
> On most Intel and AMD CPUs (I tested Haswell for old Intel, Emerald Rapids for
> new Intel, and Zen 2 for slightly-old AMD), crc32q has 3 cycle latency and 1 per
> cycle throughput.  So you do need at least 3 streams.

Bah, I missed the latency column :-)

> AMD Zen 5 has much higher crc32q throughput and seems to want up to 7 streams.
> This is not implemented yet.

The copy of the tables I have is old - doesn't contain Zen-5.
Does that mean that 2 (or more) of its alu 'units' can do crc32
so you can do more than 1/clock (along with the memory reads).

One thought is how much of it is actually worth while!
If the data isn't already in the L1 data cache then the cache
loads almost certainly dominate - especially if you have to
do out to 'real memory'.
You can benchmark the loops by repeatedly accessing the same
data - but that isn't what will actually happen.

> > Most modern x86 can do 4 or 5 (or even more) ALU operations
> > per clock - depending on the combination of instructions.
> >
> > Replace the loop termination with a comparison of 'bufp'
> > against a pre-calculated limit and you get two instructions
> > (that might get merged into one u-op) for the loop overhead.
> > They'll run in parallel with the crc32q instructions.
> 
> That's actually still three instructions: add, cmp, and jne.

I was really thinking of the loop I quoted later.
The one that uses negative offsets from the end of the buffer.
That has an 'add' and a 'jnz' - which might even fuse into a
single u-op.
Maybe even constrained to p6 - so won't go near p1.
(I don't have a recent AMD cpu)

It may not actually matter.
The add/subtract/cmp are only dependant on themselves.
Similarly the jne is only dependant on the result of the sub/cmp.
In principle they can all run in the same clock (for different
loop cycles) since the rest of the loop only needs one of the
ALU blocks (on Intel only P1 can do crc).
But I failed to get a 1 clock loop (using ADC - which doesn't
have a latency issue).
It might be impossible because a predicted-taken conditional jmp
has a latency of 2.

> I tried it on both Intel and AMD, and it did not help.
> 
> > I've never managed to get a 1-clock loop, but two is easy.
> > You might find that just:
> > 10:
> > 	crc32q	(bufp), crc
> > 	crc32q	8(bufp), crc
> > 	add		$16, bufp
> > 	cmp		bufp, buf_lim
> > 	jne		10b
> > will run at 8 bytes/clock on modern intel cpu.
> 
> No, the latency of crc32q is still three cycles.  You need three streams.

If you need three streams to get one crc32/clock then, in theory,
you can get two more simple ALU ops, at least one memory read and
a jump in every clock - even on Sandy bridge.
So they are unlikely to dominate the loop whatever you do.

If the loop is too long you can get a stall (probably) because a register
has to be read back from the real register file and not just forwarded
from a previous use/alu result.
I've gained a clock back by adding an extra instruction in the middle
of a loop!
But the not-unrolled (multi-stream) loop isn't long enough for that
to be an issue.

Enough rambling.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)