lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <019701cc56e8$e606f9c0$b214ed40$@systemfabricworks.com>
Date:	Tue, 9 Aug 2011 18:05:56 -0500
From:	"Bob Pearson" <rpearson@...temfabricworks.com>
To:	"'Joakim Tjernlund'" <joakim.tjernlund@...nsmode.se>,
	"George Spelvin" <linux@...izon.com>
Cc:	<akpm@...ux-foundation.org>, <fzago@...temfabricworks.com>,
	<linux@...izon.com>, <linux-kernel@...r.kernel.org>
Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff



> 
> Doing this adds one insn to ppc. What is good for x86 isn't for
> pcc32 (and I guess any modern arch).

I looked at the assembly output from the original code and I see two
compares at the end of the first loop, one for --len and one for (buf & 3)
which it has to compute. You are better off just computing buf & 3 once and
comparing with len once to compute init_len then in each iteration in the
loop you only have to compare one thing.

As proposed:
        init_len = min((-((uintptr_t)buf)) & 3, len);
        ...

        /* process unaligned initial bytes */
        for (i = 0; i < init_len; i++)
                DO_CRC(*buf++);


crc32x_le:
        <.....>
        negq    %rcx				# (-buf)
        andl    $3, %ecx				# (-buf) & 3
        cmpq    %rdx, %rcx				# min()
        cmova   %rdx, %rcx				# init_len = 
        xorl    %edi, %edi				# i = 0
        <.....>
        jmp     .L2
.L3:
# crc = tab[0][(crc ^ (*buf++)) & 255] ^ (crc >> 8)
        movb    (%rsi,%rdi), %dl			# buf[i]
        incq    %rdi					# buf++
        xorl    %eax, %edx				# crc ^ *buf++
        shrl    $8, %eax				# crc >> 8
        movzbl  %dl, %edx				# & 255
        xorl    crc32table_le(,%rdx,4), %eax                 # crc =
tab[...] ^ (crc >> 8)
.L2:
        cmpq    %rcx, %rdi				# compare i with
init_len
        jb      .L3
        <.....>

As was originally:
        if (unlikely((long)buf & 3 && len)) {
                do {
                        DO_CRC(*buf++);
                } while ((--len) && ((long)buf)&3);
        }

crc32x_le:
        pushq   %rbp
        movl    %edi, %eax
        pushq   %rbx
        testb   $3, %sil
        je      .L16
        testq   %rdx, %rdx
.L25:
        je      .L16
        movb    (%rsi), %cl				# *p
        incq    %rsi					# p++
        xorl    %eax, %ecx				# crc ^ *p
        shrl    $8, %eax				# crc >> 8
        movzbl  %cl, %ecx				# ... & 255
        xorl    crc32table_le(,%rcx,4), %eax		# tab[...] ^ (crc >>
8)
        decq    %rdx				# len--
        je      .L16					# if len != 0
continue
        testb   $3, %sil				# buf & 3
        jmp     .L25					# if buf & 3
continue
.L16:

The first loop has 8 instructions second one has 11 instructions. First loop
has slightly more setup to compute init_len, second loop has a couple of
extra branches to compute the if() outside of the loop.

I get better performance with the new form of the loop. How, does PPC get
better results with two tests?

> 
> > >
> > > -/* implements slicing-by-4 or slicing-by-8 algorithm */
> > > -static inline u32
> > > -crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32
> > > (*tab)[256])
> >
> > After careful measurements and looking at asm code I figured out that
> there
> > was no penalty for using 2D array. That allowed me to go back to the
> > original form.
> 
> What about my suggestion to assign each table to a ptr? This was good
> for ppc so if it doesn't slow x86 it should be added.

Gcc is very clever and inlined crc32_body() into crc32_le() and crc32_be()
and then replaced references to tab[x][y] with references to crc32table_le
and crc32table_be plus a constant offset which depended on the first index
([x]). So in effect it completely unrolled back to 1D arrays as
(crc32table_le + offset)[y], which is equivalent to the t0_le[y] code I had
before.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ