[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230822051710.GC1661@sol.localdomain>
Date: Mon, 21 Aug 2023 22:17:10 -0700
From: Eric Biggers <ebiggers@...nel.org>
To: Kamlesh Gurudasani <kamlesh@...com>
Cc: Herbert Xu <herbert@...dor.apana.org.au>,
"David S. Miller" <davem@...emloft.net>,
Rob Herring <robh+dt@...nel.org>,
Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>,
Conor Dooley <conor+dt@...nel.org>, Nishanth Menon <nm@...com>,
Vignesh Raghavendra <vigneshr@...com>,
Tero Kristo <kristo@...nel.org>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>,
Maxime Coquelin <mcoquelin.stm32@...il.com>,
Alexandre Torgue <alexandre.torgue@...s.st.com>,
linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
devicetree@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-stm32@...md-mailman.stormreply.com
Subject: Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for Texas Instruments
MCRC64 engine
On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote:
> Hi Eric,
>
> We are more interested in offload than performance, with splice system
> call and DMA mode in driver(will be implemented after this series gets
> merged), good amount of cpu cycles will be saved.
So it's for power usage, then? Or freeing up CPU for other tasks?
> There is one more mode(auto mode) in mcrc64 which helps to verify crc64
> values against pre calculated crc64, saving the efforts of comparing in
> userspace.
Is there any path forward to actually support this?
>
> Current generic implementation of crc64-iso(part of this series)
> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812
> Mb/s when tested with tcrypt.
This doesn't answer my question, which to reiterate was:
How does performance compare to a properly optimized software CRC
implementation on your platform, i.e. an implementation using carryless
multiplication instructions (e.g. ARMv8 CE) if available on your platform,
otherwise an implementation using the slice-by-8 or slice-by-16 method?
The implementation you tested was slice-by-1. Compared to that, it's common for
slice-by-8 to speed up CRCs by about 4 times and for folding with carryless
multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory
bandwidth. I don't know what specific results you would get on your specific
CPU and for this specific CRC, and you could certainly see something different
if you e.g. have some low-end embedded CPU. But those are the typical results
I've seen for other CRCs on different CPUs. So, a software implementation may
be more attractive than you realize. It could very well be the case that a
PMULL based CRC implementation actually ends up with less CPU load than your
"hardware offload", when taking into syscall, algif_hash, and driver overhead...
- Eric
Powered by blists - more mailing lists