linux-kernel - Re: [EXTERNAL] Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for Texas Instruments MCRC64 engine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v8cwd4je.fsf@kamlesh.i-did-not-set--mail-host-address--so-tickle-me>
Date:   Wed, 30 Aug 2023 19:18:53 +0530
From:   Kamlesh Gurudasani <kamlesh@...com>
To:     Eric Biggers <ebiggers@...nel.org>
CC:     Herbert Xu <herbert@...dor.apana.org.au>,
        "David S. Miller" <davem@...emloft.net>,
        Rob Herring <robh+dt@...nel.org>,
        "Krzysztof Kozlowski" <krzysztof.kozlowski+dt@...aro.org>,
        Conor Dooley <conor+dt@...nel.org>, Nishanth Menon <nm@...com>,
        Vignesh Raghavendra <vigneshr@...com>,
        Tero Kristo <kristo@...nel.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Maxime Coquelin <mcoquelin.stm32@...il.com>,
        Alexandre Torgue <alexandre.torgue@...s.st.com>,
        <linux-crypto@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <devicetree@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>,
        <linux-stm32@...md-mailman.stormreply.com>, <kamlesh@...com>
Subject: Re: [EXTERNAL] Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for
 Texas Instruments MCRC64 engine

Eric Biggers <ebiggers@...nel.org> writes:

> On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote:
>> Hi Eric,
>> 
>> We are more interested in offload than performance, with splice system
>> call and DMA mode in driver(will be implemented after this series gets
>> merged), good amount of cpu cycles will be saved.
>
> So it's for power usage, then?  Or freeing up CPU for other tasks?
>

It's for freeing up CPU for other tasks

>> There is one more mode(auto mode) in mcrc64 which helps to verify crc64
>> values against pre calculated crc64, saving the efforts of comparing in
>> userspace.
>
> Is there any path forward to actually support this?
>
>> 
>> Current generic implementation of crc64-iso(part of this series)
>> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812
>> Mb/s when tested with tcrypt.
>
> This doesn't answer my question, which to reiterate was:
>
>     How does performance compare to a properly optimized software CRC
>     implementation on your platform, i.e. an implementation using carryless
>     multiplication instructions (e.g. ARMv8 CE) if available on your platform,
>     otherwise an implementation using the slice-by-8 or slice-by-16 method?
>
> The implementation you tested was slice-by-1.  Compared to that, it's common for
> slice-by-8 to speed up CRCs by about 4 times and for folding with carryless
> multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory
> bandwidth.  I don't know what specific results you would get on your specific
> CPU and for this specific CRC, and you could certainly see something different
> if you e.g. have some low-end embedded CPU.  But those are the typical results
> I've seen for other CRCs on different CPUs.  So, a software implementation may
> be more attractive than you realize.  It could very well be the case that a
> PMULL based CRC implementation actually ends up with less CPU load than your
> "hardware offload", when taking into syscall, algif_hash, and driver overhead...
>
> - Eric

Hi Eric, thanks for your detailed and valuable inputs.

As per your suggestion, we did some profiling. 

Use case is to calculate crc32/crc64 for file input from user space.

Instead of directly implementing PMULL based CRC64, we made first comparison between 
Case 1.
CRC32 (splice() + kernel space SW driver) 
https://gist.github.com/ti-kamlesh/5be75dbde292e122135ddf795fad9f21

Case 2.
CRC32(mmap() + userspace armv8 crc32 instruction implementation)
(tried read() as well to get contents of file, but that lost to mmap() so not mentioning number here)
https://gist.github.com/ti-kamlesh/002df094dd522422c6cb62069e15c40d

Case 3.
CRC64 (splice() + MCRC64 HW)
https://gist.github.com/ti-kamlesh/98b1fc36c9a7c3defcc2dced4136b8a0


Overall, overhead of userspace + af_alg + driver in (Case 1) and ( Case 3) is ~0.025s, which is constant for any file size.
This is calculated using real time to calculate crc  - driver time (time spend inside init() + update() +final()) = overhead ~0.025s    

Here, if we consider similar numbers for crc64 PMULL implementation as crc32 (case 2) , we save good number of cpu cycles using mcrc64
in case of files bigger than 5-10mb as most of the time is being spent in HW offload.

+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
|                   |                             |                       |                        |                        |
| File size         | 120mb(ideal size for us)    | 20mb                  | 15mb                   | 5mb                    |
+===================+=============================+=======================+========================+========================+
|                   |                             |                       |                        |                        |
| CRC32 (Case 1)    | Driver time 0.155s          | Driver time 0.0325s   | Driver time 0.019s     | Driver time 0.0062s    |
|                   |    real time 0.18s          |    real time 0.06s    |    real time 0.04s     |    real time 0.03s     |
|                   |    overhead 0.025s          |    overhead 0.025s    |    overhead 0.021s     |    overhead ~0.023s    |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
|                   |                             |                       |                        |                        |
| CRC32 (Case 2)    | Real time 0.30s             | Real time 0.05s       | Real time 0.04s        | Real time 0.02s        |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
|                   |                             |                       |                        |                        |
| CRC64 (Case 3)    | Driver time   0.385s        | Driver time 0.0665s   | Driver time 0.0515s    | Driver time 0.019s     |
|                   |    real time 0.41s          |    real time 0.09s    |    real time 0.08s     |    real time 0.04s     |
|                   |    overhead 0.025s          |    overhead 0.025s    |    overhead ~0.025s    |    overhead ~0.021s    |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+