lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <389595e9-e13a-42e3-b0ff-9ca0dd3effe3@linux.ibm.com>
Date: Fri, 16 Jan 2026 18:31:58 +0100
From: Holger Dengler <dengler@...ux.ibm.com>
To: David Laight <david.laight.linux@...il.com>,
        Eric Biggers <ebiggers@...nel.org>
Cc: Ard Biesheuvel <ardb@...nel.org>, "Jason A . Donenfeld"
 <Jason@...c4.com>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        Harald Freudenberger <freude@...ux.ibm.com>,
        linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org
Subject: Re: [PATCH v1 1/1] lib/crypto: tests: Add KUnit tests for AES

Hi David,

On 15/01/2026 23:05, David Laight wrote:
> On Thu, 15 Jan 2026 12:43:32 -0800
> Eric Biggers <ebiggers@...nel.org> wrote:
>>> +static void benchmark_aes(struct kunit *test, const struct aes_testvector *tv)
>>> +{
>>> +	const size_t num_iters = 10000000;  
>>
>> 10000000 iterations is too many.  That's 160 MB of data in each
>> direction per AES key length.  Some CPUs without AES instructions can do
>> only ~20 MB AES per second.  In that case, this benchmark would take 16
>> seconds to run per AES key length, for 48 seconds total.
> 
> Probably best to first do a test that would take a 'reasonable' time
> on a cpu without AES. If that is 'very fast' then do a longer test
> to get more accuracy on a faster implementation.
> 
>>
>> hash-test-template.h and crc_kunit.c use 10000000 / (len + 128)
>> iterations.  That would be 69444 in this case (considering len=16),
>> which is less than 1% of the iterations you've used.  Choosing a number
>> similar to that would seem more appropriate.
>>
>> Ultimately these are just made-up numbers.  But I think we should aim
>> for the benchmark test in each KUnit test suite to take less than a
>> second or so.  The existing tests roughly achieve that, whereas it seems
>> this one can go over it by quite a bit due to the 10000000 iterations.
> 
> Even 1 second is a long time, you end up getting multiple interrupts included.
> I think a lot of these benchmarks are far too long.
> Timing differences less that 1% can be created by scheduling noise.
> Running a test that takes 200 'quanta' of the timer used has an
> error margin of under 1% (100 quanta might be enough).
> While the kernel timestamps have a resolution of 1ns the accuracy is worse.
> If you run a test for even just 10us you ought to get reasonable accuracy
> with a reasonable hope of not getting an interrupt.
> Run the test 10 times and report the fastest value.
> 
> You'll then find the results are entirely unstable because the cpu clock
> frequency keeps changing.
> And long enough buffers can get limited by the d-cache loads.
> 
> For something as slow as AES you can count the number of cpu cycles for
> a single call and get a reasonably consistent figure.
> That will tell you whether the loop is running at the speed you might
> expect it to run at.
> (You need to use data dependencies between the start/end 'times' and
> start/end of the code being timed, x86 lfence/mfence are too slow and
> can hide the 'setup' cost of some instructions.)

Thanks a lot for your feedback. I tried a few of your ideas and it turns out,
that they work quite well. First of all, with a single-block aes
encrypt/decrypt in our hardware (CPACF), we're very close to the resolution of
our CPU clock.

Disclaimer: The encryption/decryption of one block takes ~32ns (~500MB/s).
These numbers should be taken with some care, as on s390 the operating system
always runs virtualized. In my test environment, I also only have access to a
machine with shared CPUs, so there might be some negative impact from other
workload.

The benchmark loops for 100 iterations now without any warm-up. In each
iteration, I measure a single aes_encrypt()/aes_decrypt() call. The lowest
value of these measurements is takes as the value for the bandwidth
calculations. Although it is not necessary in my environment, I'm doing all
iterations with preemption disabled. I think, that this might help on other
platforms to reduce the jitter of the measurement values.

The removal of the warm-up does not have any impact on the numbers.

Just for information: I also tried to measure the cycles with the same
results. The minimal measurement value of a few iterations is much more stable
that the average over a larger number of iterations.

I also did some tests with IRQs disabled (instead of only preemption), but the
numbers stay the same. So I think, it is save enough to stay with disables
preemption.

I also tried you idea, first to do a few measurements and if they are fast
enough, increase the number of iterations. But it turns out, that this it not
really necessary (at least in my env). But I can add this, it it makes sense
on other platforms.

-- 
Mit freundlichen Grüßen / Kind regards
Holger Dengler
--
IBM Systems, Linux on IBM Z Development
dengler@...ux.ibm.com


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ