linux-kernel - Crypto use cases (was: Remove PowerPC optimized MD5 code)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <913e23f9-d039-4de1-a0d3-d1067dcda8ac@hogyros.de>
Date: Tue, 5 Aug 2025 13:49:31 +0900
From: Simon Richter <Simon.Richter@...yros.de>
To: Eric Biggers <ebiggers@...nel.org>,
 Christophe Leroy <christophe.leroy@...roup.eu>
Cc: linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
 Ard Biesheuvel <ardb@...nel.org>, "Jason A . Donenfeld" <Jason@...c4.com>,
 linux-mips@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
 sparclinux@...r.kernel.org
Subject: Crypto use cases (was: Remove PowerPC optimized MD5 code)

Hi,

On 8/5/25 07:59, Eric Biggers wrote:

>> md5sum uses the kernel's MD5 code:

> What?  That's crazy.  Userspace MD5 code would be faster and more
> reliable.  No need to make syscalls, transfer data to and from the
> kernel, have an external dependency, etc.  Is this the coreutils md5sum?
> We need to get this reported and fixed.

The userspace API allows zero-copy transfers from userspace, and AFAIK 
also directly operating on files without ever transferring the data to 
userspace (so we save one copy).

Userspace requests are also where the asynchronous hardware offload 
units get to chomp on large blocks of data while the CPU is doing 
something else:

$ time dd if=test.bin of=/dev/zero bs=1G     # warm up caches
real    0m1.541s
user    0m0.000s
sys     0m0.732s

$ time gzip -9 <test.bin >test.bin.gz        # compress with the CPU
real    2m57.789s
user    2m55.986s
sys     0m1.508s

$ time ./gzfht_test test.bin                 # compress with NEST unit
real    0m3.207s
user    0m0.584s
sys     0m2.487s

$ time gzip -d <test.bin.nx.gz >test.bin.nx  # decompress with CPU
real    1m0.103s
user    0m57.990s
sys     0m1.878s

$ time ./gunz_test test.bin.gz               # decompress with NEST unit
real    0m2.722s
user    0m0.200s
sys     0m1.872s

That's why I'm objecting to measuring the general usefulness of hardware 
crypto units by the standards of fscrypt, which has an artificial 
limitation of never submitting blocks larger than 4kB: there are other 
use cases that don't have that limitation, and where the overhead is 
negligible because it is incurred only once for a few gigabytes of data.

That's why I suggested changing from a priority field to "speed" and 
"overhead" fields, and calculate priority for each application as 
(size/speed+overhead) -- smallest number wins, size is what the 
application expects to use as the typical request size (which for 
fscrypt and IPsec is on the small side, so it would always select the 
CPU unless there was a low-overhead offload engine available)

This probably needs some adjustment to allow selecting a low-power 
implementation (e.g. on mobile, I'd want to use offloading for fscrypt 
even if it is slower), and model request batching which reduces the 
overhead in a busy system, but it should be a good start.

    Simon