[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250615184638.GA1480@sol>
Date: Sun, 15 Jun 2025 11:46:38 -0700
From: Eric Biggers <ebiggers@...nel.org>
To: Ard Biesheuvel <ardb@...nel.org>
Cc: Herbert Xu <herbert@...dor.apana.org.au>, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-mips@...r.kernel.org, linux-riscv@...ts.infradead.org,
linux-s390@...r.kernel.org, sparclinux@...r.kernel.org,
x86@...nel.org, Jason@...c4.com, torvalds@...ux-foundation.org
Subject: Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt
fallback path
On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@...nel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug. I got the following results:
> >
> > ARMv8 crypto extensions: 1864 MB/s
> > Generic C code: 358 MB/s
> > Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
>
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
>
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.
With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.
(BTW, that's single-buffer ARMv8 CE. My two-buffer code is over 3000 MB/s.)
I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle. (This would be a best-case
scenario for QCE; in many real use cases this is not possible.) Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.
So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU. It's even slower than the generic C code.
And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.
But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org
- Eric
Powered by blists - more mailing lists