linux-kernel - Re: [PATCH v1 1/1] lib/crypto: tests: Add KUnit tests for AES

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260116183744.04781509@pumpkin>
Date: Fri, 16 Jan 2026 18:37:44 +0000
From: David Laight <david.laight.linux@...il.com>
To: Holger Dengler <dengler@...ux.ibm.com>
Cc: Eric Biggers <ebiggers@...nel.org>, Ard Biesheuvel <ardb@...nel.org>,
 "Jason A . Donenfeld" <Jason@...c4.com>, Herbert Xu
 <herbert@...dor.apana.org.au>, Harald Freudenberger <freude@...ux.ibm.com>,
 linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org
Subject: Re: [PATCH v1 1/1] lib/crypto: tests: Add KUnit tests for AES

On Fri, 16 Jan 2026 18:31:58 +0100
Holger Dengler <dengler@...ux.ibm.com> wrote:

> Hi David,
> 
> On 15/01/2026 23:05, David Laight wrote:
> > On Thu, 15 Jan 2026 12:43:32 -0800
> > Eric Biggers <ebiggers@...nel.org> wrote:  
> >>> +static void benchmark_aes(struct kunit *test, const struct aes_testvector *tv)
> >>> +{
> >>> +	const size_t num_iters = 10000000;    
> >>
> >> 10000000 iterations is too many.  That's 160 MB of data in each
> >> direction per AES key length.  Some CPUs without AES instructions can do
> >> only ~20 MB AES per second.  In that case, this benchmark would take 16
> >> seconds to run per AES key length, for 48 seconds total.  
> > 
> > Probably best to first do a test that would take a 'reasonable' time
> > on a cpu without AES. If that is 'very fast' then do a longer test
> > to get more accuracy on a faster implementation.
> >   
> >>
> >> hash-test-template.h and crc_kunit.c use 10000000 / (len + 128)
> >> iterations.  That would be 69444 in this case (considering len=16),
> >> which is less than 1% of the iterations you've used.  Choosing a number
> >> similar to that would seem more appropriate.
> >>
> >> Ultimately these are just made-up numbers.  But I think we should aim
> >> for the benchmark test in each KUnit test suite to take less than a
> >> second or so.  The existing tests roughly achieve that, whereas it seems
> >> this one can go over it by quite a bit due to the 10000000 iterations.  
> > 
> > Even 1 second is a long time, you end up getting multiple interrupts included.
> > I think a lot of these benchmarks are far too long.
> > Timing differences less that 1% can be created by scheduling noise.
> > Running a test that takes 200 'quanta' of the timer used has an
> > error margin of under 1% (100 quanta might be enough).
> > While the kernel timestamps have a resolution of 1ns the accuracy is worse.
> > If you run a test for even just 10us you ought to get reasonable accuracy
> > with a reasonable hope of not getting an interrupt.
> > Run the test 10 times and report the fastest value.
> > 
> > You'll then find the results are entirely unstable because the cpu clock
> > frequency keeps changing.
> > And long enough buffers can get limited by the d-cache loads.
> > 
> > For something as slow as AES you can count the number of cpu cycles for
> > a single call and get a reasonably consistent figure.
> > That will tell you whether the loop is running at the speed you might
> > expect it to run at.
> > (You need to use data dependencies between the start/end 'times' and
> > start/end of the code being timed, x86 lfence/mfence are too slow and
> > can hide the 'setup' cost of some instructions.)  
> 
> Thanks a lot for your feedback. I tried a few of your ideas and it turns out,
> that they work quite well. First of all, with a single-block aes
> encrypt/decrypt in our hardware (CPACF), we're very close to the resolution of
> our CPU clock.
> 
> Disclaimer: The encryption/decryption of one block takes ~32ns (~500MB/s).
> These numbers should be taken with some care, as on s390 the operating system
> always runs virtualized. In my test environment, I also only have access to a
> machine with shared CPUs, so there might be some negative impact from other
> workload.

The impact of other workloads is much less likely for a short test,
and if it does happen you are likely to see a value that is abnormally large.

> The benchmark loops for 100 iterations now without any warm-up. In each
> iteration, I measure a single aes_encrypt()/aes_decrypt() call. The lowest
> value of these measurements is takes as the value for the bandwidth
> calculations. Although it is not necessary in my environment, I'm doing all
> iterations with preemption disabled. I think, that this might help on other
> platforms to reduce the jitter of the measurement values.
> 
> The removal of the warm-up does not have any impact on the numbers.

I'm not sure what the 'warm-up' was for.
The first test will be slow(er) due to I-cache misses.
(That will be more noticeable for big software loops - like blake2.)
Change to test parameters can affect branch prediction but that also only
usually affects the first test with each set of parameters.
(Unlikely to affect AES, but I could see that effect when testing
mul_u64_u64_div_u64().)
The only other reason for a 'warm-up' is to get the cpu frequency fast
and fixed - and there ought to be a better way of doing that.

> 
> Just for information: I also tried to measure the cycles with the same
> results. The minimal measurement value of a few iterations is much more stable
> that the average over a larger number of iterations.

My userspace test code runs each test 10 times and prints all 10 values.
I then look at them to see how consistent they are.

> I also did some tests with IRQs disabled (instead of only preemption), but the
> numbers stay the same. So I think, it is save enough to stay with disables
> preemption.

I'd actually go for disabling interrupts.
What you are seeing is the effect of interrupts not happening
(which is likely for a short test, but not for a long one).

> 
> I also tried you idea, first to do a few measurements and if they are fast
> enough, increase the number of iterations. But it turns out, that this it not
> really necessary (at least in my env). But I can add this, it it makes sense
> on other platforms.

The main reason for doing that is reducing the time the tests take on a
system that is massively slower (and doing software AES).
Maybe someone want to run the test cases on an m68k :-)

	David