[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1401316379.2970.387.camel@schen9-DESK>
Date: Wed, 28 May 2014 15:32:59 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: George Spelvin <linux@...izon.com>
Cc: herbert@...dor.apana.org.au, JBeulich@...e.com,
linux-kernel@...r.kernel.org, sandyw@...tter.com,
James Guilford <james.guilford@...el.com>
Subject: Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink
K_table
On Wed, 2014-05-28 at 10:40 -0400, George Spelvin wrote:
> While following a number of tangents in the code (I was figuring out
> how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit
> words, all of which had the high half set to zero.
>
> Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously
> present, so it could use pmovzxdq and save 1K of kernel data.
>
> The following patch obviously lacks the kludges for old binutils,
> but should convey the general idea.
>
> Jan: Is support for SLE10's pre-2.18 binutils still required?
> Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
>
> Two other minor additional changes:
>
> 1. The current code unnecessarily puts the table in the read-write
> .data section. Moved to .text.
> 2. I'm also not sure why it's necessary to force such large alignment
> on K_table. Comments on reducing it?
>
> Signed-off-by: George Spelvin <linux@...izon.com>
>
>
> diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> index dbc4339b..9f885ee4 100644
> --- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> +++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> @@ -216,15 +216,11 @@ LABEL crc_ %i
> ## 4) Combine three results:
> ################################################################
>
> - lea (K_table-16)(%rip), bufp # first entry is for idx 1
> + lea (K_table-8)(%rip), bufp # first entry is for idx 1
> shlq $3, %rax # rax *= 8
> - subq %rax, tmp # tmp -= rax*8
> - shlq $1, %rax
> - subq %rax, tmp # tmp -= rax*16
> - # (total tmp -= rax*24)
> - addq %rax, bufp
> -
> - movdqa (bufp), %xmm0 # 2 consts: K1:K2
> + pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
Changing from the aligned move (movdqa) to unaligned move and zeroing
(pmovzxdq), is going to make things slower. If the table is aligned
on 8 byte boundary, some of the table can span 2 cache lines, which
can slow things further.
We are trading speed for only 4096 bytes of memory save,
which is likely not a good trade for most systems except for
those really constrained of memory. For this kind of non-performance
critical system, it may as well use the generic crc32c algorithm and
compile out this module.
Thanks.
Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists