lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625191701.GC1703@sol>
Date: Wed, 25 Jun 2025 12:17:01 -0700
From: Eric Biggers <ebiggers@...nel.org>
To: Maxime MERE <maxime.mere@...s.st.com>
Cc: Simon Richter <Simon.Richter@...yros.de>, linux-fscrypt@...r.kernel.org,
	linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-mtd@...ts.infradead.org, linux-ext4@...r.kernel.org,
	linux-f2fs-devel@...ts.sourceforge.net, ceph-devel@...r.kernel.org
Subject: Re: [PATCH] fscrypt: don't use hardware offload Crypto API drivers

On Wed, Jun 25, 2025 at 06:29:26PM +0200, Maxime MERE wrote:
> Hi,
> 
> On 6/25/25 08:32, Eric Biggers wrote:
> > That was the synchronous throughput.  However, submitting multiple requests
> > asynchronously (which again, fscrypt doesn't actually do) barely helps.
> > Apparently the STM32 crypto engine has only one hardware queue.
> > 
> > I already strongly suspected that these non-inline crypto engines aren't worth
> > using.  But I didn't realize they are quite this bad.  Even with AES on a
> > Cortex-A7 CPU that lacks AES instructions, the CPU is much faster!
> 
> From a performance perspective, using hardware crypto offloads the CPU,
> which is important in real-world applications where the CPU must handle
> multiple tasks. Our processors are often single-core and not the highest
> performing, so hardware acceleration is valuable.
> 
> I can show you performance test realized with openSSL (3.2.4) who shows,
> less CPU usage and better performance for large block of data when our
> driver is used (via afalg):
> 
> command used: ```openssl speed -evp aes-256-cbc -engine afalg -elapsed```
> 
> +--------------------+--------------+-----------------+
> | Block Size (bytes) | AFALG (MB/s) | SW BASED (MB/s) |
> +--------------------+--------------+-----------------+
> | 16                 | 0.09         | 9.44            |
> | 64                 | 0.34         | 11.43           |
> | 256                | 1.31         | 12.08           |
> | 1024               | 4.96         | 12.27           |
> | 8192               | 18.18        | 12.33           |
> | 16384              | 22.48        | 12.33           |
> +--------------------+--------------+-----------------+
> 
> to test CPU usage I've used a monocore stm32mp157f.
> here with afalg, we have an average CPU usage of ~75%, with the sw based
> approach CPU is used at ~100%
> 
> Maxime

fscrypt is almost always used with 4096-byte blocks, which in my benchmark took
about 1300 μs each with AES-128-CBC-ESSIV w/ STM32 engine, 264 μs each with
AES-128-CBC-ESSIV w/ CPU, or 77 μs each with Adiantum w/ CPU.  The CPU-based
times seem short enough that there isn't much time for another task to be
usefully scheduled while waiting for each block.  It's important to consider (a)
driver overhead, (b) scheduling overhead, and (c) the low instructions per
second of this processor in the first place.

By the way, the board I have (STM32MP157F-DK2) is actually multi-core.  It seems
this is common among ST's offerings that are intended to run Linux?  (Of course,
the microcontrollers that don't run Linux are another story.)

- Eric

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ