[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGG=3QUnUQL2=YxN2ozwSba2A_x-S7sAEUP5oGhCWOzu4Q9SQA@mail.gmail.com>
Date: Mon, 3 Mar 2025 16:16:43 -0800
From: Bill Wendling <morbo@...gle.com>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: David Laight <david.laight.linux@...il.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" <x86@...nel.org>, Eric Biggers <ebiggers@...nel.org>, Ard Biesheuvel <ardb@...nel.org>,
Nathan Chancellor <nathan@...nel.org>, Nick Desaulniers <nick.desaulniers+lkml@...il.com>,
Justin Stitt <justinstitt@...gle.com>, LKML <linux-kernel@...r.kernel.org>,
linux-crypto@...r.kernel.org, clang-built-linux <llvm@...ts.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@...or.com> wrote:
> On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@...il.com> wrote:
> >On Mon, 3 Mar 2025 12:27:21 -0800
> >Bill Wendling <morbo@...gle.com> wrote:
> >
> >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
> >> <david.laight.linux@...il.com> wrote:
> >> > On Thu, 27 Feb 2025 15:47:03 -0800
> >> > Bill Wendling <morbo@...gle.com> wrote:
> >> >
> >> > > For both gcc and clang, crc32 builtins generate better code than the
> >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> >> > > does the same and unrolls the loops. GCC has no changes on i386, but
> >> > > Clang's code generation is vastly improved, due to Clang's "rm"
> >> > > constraint issue.
> >> > >
> >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> >> > > is expected because of the "rm" issue. However, Clang's performance is
> >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.
> >> >
> >> > How much does it unroll?
> >> > How much you need depends on the latency of the crc32 instruction.
> >> > The copy of Agner's tables I have gives it a latency of 3 on
> >> > pretty much everything.
> >> > If you can only do one chained crc instruction every three clocks
> >> > it is hard to see how unrolling the loop will help.
> >> > Intel cpu (since sandy bridge) will run a two clock loop.
> >> > With three clocks to play with it should be easy (even for a compiler)
> >> > to generate a loop with no extra clock stalls.
> >> >
> >> > Clearly if Clang decides to copy arguments to the stack an extra time
> >> > that will kill things. But in this case you want the "m" constraint
> >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
> >> >
> >> Below is what Clang generates with the builtins. From what Eric said,
> >> this code is only run for sizes <= 512 bytes? So maybe it's not super
> >> important to micro-optimize this. I apologize, but my ability to
> >> measure clock loops for x86 code isn't great. (I'm sure I lack the
> >> requisite benchmarks, etc.)
> >
> >Jeepers - that is trashing the I-cache.
> >Not to mention all the conditional branches at the bottom.
> >Consider the basic loop:
> >1: crc32q (%rcx), %rbx
> > addq $8, %rcx
> > cmp %rcx, %rdx
> > jne 1b
> >The crc32 has latency 3 so it must take at least 3 clocks.
> >Even naively the addq can be issued in the same clock as the crc32
> >and the cmp and jne in the following ones.
> >Since the jne is predicted taken, the addq can be assumed to execute
> >in the same clock as the jne.
> >(The cmp+jne might also get merged into a single u-op)
> >(I've done this with adc (for IP checksum), with two adc the loop takes
> >two clocks even with the extra memory reads.)
> >
> >So that loop is likely to run limited by the three clock latency of crc32.
> >Even the memory reads will happen with all the crc32 just waiting for the
> >previous crc32 to finish.
> >You can take an instruction out of the loop:
> >1: crc32q (%rcx,%rdx), %rbx
> > addq $8, %rdx
> > jne 1b
> >but that may not be necessary, and (IIRC) gcc doesn't like letting you
> >generate it.
> >
> >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
> >a byte depends on how far it is from the end of the buffer, and that initial
> >zero bytes have no effect.
> >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
> >right by the number of bytes needed to make the rest of the buffer a multiple
> >or 8 bytes (the same as reading from across the start of the buffer and masking
> >the low bytes) then treat exactly the same as a buffer that is a multiple
> >of 8 bytes long.
> >Don't worry about misaligned reads, you lose less than one clock per cache
> >line (that is with adc doing a read every clock).
> >
For reference, GCC does much better with code gen, but only with the builtin:
.L39:
crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120
addq $8, %rax #, p
cmpq %rcx, %rax # _37, p
jne .L39 #,
leaq (%rsi,%rdi,8), %rsi #, p
.L38:
andl $7, %edx #, len
je .L41 #,
addq %rsi, %rdx # p, _11
movl %ebx, %eax # crc, <retval>
.p2align 4
.L40:
crc32b (%rsi), %eax # MEM[(const u8 *)p_45], <retval>
addq $1, %rsi #, p
cmpq %rsi, %rdx # p, _11
jne .L40 #,
> >Actually measuring the performance is hard.
> >You can use rdtsc because the clock speed will change when the cpu gets busy.
> >There is a 'performance counter' that is actual clocks.
> >While you can use the library functions to set it up, you need to just read the
> >register - the library overhead it too big.
> >You also need the odd lfence.
> >Having done that, and provided the buffer is in the L1 d-cache you can measure
> >the loop time in clocks and compare against the expected value.
> >Once you've got 3 clocks per crc32 instruction it won't get any better,
> >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
> >in parallel.
> >
Thanks for the info! It'll help a lot the next time I need to delve
deeply into performance.
I tried using rdtsc and another programmatic way of measuring timing.
Also tried making the task have high priority, restricting to one CPU,
etc. But the numbers weren't as consistent as I wanted them to be. The
times I reported were the based on the fastest times / clocks /
whatever from several runs for each build.
> > David
> >
> >>
> >> -bw
> >>
> >> .LBB1_9: # =>This Inner Loop Header: Depth=1
> >> movl %ebx, %ebx
> >> crc32q (%rcx), %rbx
> >> addq $8, %rcx
> >> incq %rdi
> >> cmpq %rdi, %rsi
> >> jne .LBB1_9
> >> # %bb.10:
> >> subq %rdi, %rax
> >> jmp .LBB1_11
> >> .LBB1_7:
> >> movq %r14, %rcx
> >> .LBB1_11:
> >> movq %r15, %rsi
> >> andq $-8, %rsi
> >> cmpq $7, %rdx
> >> jb .LBB1_14
> >> # %bb.12:
> >> xorl %edx, %edx
> >> .LBB1_13: # =>This Inner Loop Header: Depth=1
> >> movl %ebx, %ebx
> >> crc32q (%rcx,%rdx,8), %rbx
> >> crc32q 8(%rcx,%rdx,8), %rbx
> >> crc32q 16(%rcx,%rdx,8), %rbx
> >> crc32q 24(%rcx,%rdx,8), %rbx
> >> crc32q 32(%rcx,%rdx,8), %rbx
> >> crc32q 40(%rcx,%rdx,8), %rbx
> >> crc32q 48(%rcx,%rdx,8), %rbx
> >> crc32q 56(%rcx,%rdx,8), %rbx
> >> addq $8, %rdx
> >> cmpq %rdx, %rax
> >> jne .LBB1_13
> >> .LBB1_14:
> >> addq %rsi, %r14
> >> .LBB1_15:
> >> andq $7, %r15
> >> je .LBB1_23
> >> # %bb.16:
> >> crc32b (%r14), %ebx
> >> cmpl $1, %r15d
> >> je .LBB1_23
> >> # %bb.17:
> >> crc32b 1(%r14), %ebx
> >> cmpl $2, %r15d
> >> je .LBB1_23
> >> # %bb.18:
> >> crc32b 2(%r14), %ebx
> >> cmpl $3, %r15d
> >> je .LBB1_23
> >> # %bb.19:
> >> crc32b 3(%r14), %ebx
> >> cmpl $4, %r15d
> >> je .LBB1_23
> >> # %bb.20:
> >> crc32b 4(%r14), %ebx
> >> cmpl $5, %r15d
> >> je .LBB1_23
> >> # %bb.21:
> >> crc32b 5(%r14), %ebx
> >> cmpl $6, %r15d
> >> je .LBB1_23
> >> # %bb.22:
> >> crc32b 6(%r14), %ebx
> >> .LBB1_23:
> >> movl %ebx, %eax
> >> .LBB1_24:
> >
> >
>
> The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?
Definitely on the weird side! I considered hard-coding something like
that, but thought it might be a bit convoluted, though certainly less
convoluted than what we generate now. A simple loop is probably all
that's needed, because it should only need to be done at most seven
times.
-bw
Powered by blists - more mailing lists