lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d5c7775-de20-493d-88cc-011d2261c079@linux.ibm.com>
Date: Fri, 16 Jan 2026 20:20:51 +0100
From: Holger Dengler <dengler@...ux.ibm.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Eric Biggers <ebiggers@...nel.org>, Ard Biesheuvel <ardb@...nel.org>,
        "Jason A . Donenfeld" <Jason@...c4.com>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        Harald Freudenberger <freude@...ux.ibm.com>,
        linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org
Subject: Re: [PATCH v1 1/1] lib/crypto: tests: Add KUnit tests for AES

On 16/01/2026 19:37, David Laight wrote:
> On Fri, 16 Jan 2026 18:31:58 +0100
> Holger Dengler <dengler@...ux.ibm.com> wrote:
> 
>> Hi David,
>>
>> On 15/01/2026 23:05, David Laight wrote:
>>> On Thu, 15 Jan 2026 12:43:32 -0800
>>> Eric Biggers <ebiggers@...nel.org> wrote:  
>>>>> +static void benchmark_aes(struct kunit *test, const struct aes_testvector *tv)
>>>>> +{
>>>>> +	const size_t num_iters = 10000000;    
>>>>
>>>> 10000000 iterations is too many.  That's 160 MB of data in each
>>>> direction per AES key length.  Some CPUs without AES instructions can do
>>>> only ~20 MB AES per second.  In that case, this benchmark would take 16
>>>> seconds to run per AES key length, for 48 seconds total.  
>>>
>>> Probably best to first do a test that would take a 'reasonable' time
>>> on a cpu without AES. If that is 'very fast' then do a longer test
>>> to get more accuracy on a faster implementation.
>>>   
>>>>
>>>> hash-test-template.h and crc_kunit.c use 10000000 / (len + 128)
>>>> iterations.  That would be 69444 in this case (considering len=16),
>>>> which is less than 1% of the iterations you've used.  Choosing a number
>>>> similar to that would seem more appropriate.
>>>>
>>>> Ultimately these are just made-up numbers.  But I think we should aim
>>>> for the benchmark test in each KUnit test suite to take less than a
>>>> second or so.  The existing tests roughly achieve that, whereas it seems
>>>> this one can go over it by quite a bit due to the 10000000 iterations.  
>>>
>>> Even 1 second is a long time, you end up getting multiple interrupts included.
>>> I think a lot of these benchmarks are far too long.
>>> Timing differences less that 1% can be created by scheduling noise.
>>> Running a test that takes 200 'quanta' of the timer used has an
>>> error margin of under 1% (100 quanta might be enough).
>>> While the kernel timestamps have a resolution of 1ns the accuracy is worse.
>>> If you run a test for even just 10us you ought to get reasonable accuracy
>>> with a reasonable hope of not getting an interrupt.
>>> Run the test 10 times and report the fastest value.
>>>
>>> You'll then find the results are entirely unstable because the cpu clock
>>> frequency keeps changing.
>>> And long enough buffers can get limited by the d-cache loads.
>>>
>>> For something as slow as AES you can count the number of cpu cycles for
>>> a single call and get a reasonably consistent figure.
>>> That will tell you whether the loop is running at the speed you might
>>> expect it to run at.
>>> (You need to use data dependencies between the start/end 'times' and
>>> start/end of the code being timed, x86 lfence/mfence are too slow and
>>> can hide the 'setup' cost of some instructions.)  
>>
>> Thanks a lot for your feedback. I tried a few of your ideas and it turns out,
>> that they work quite well. First of all, with a single-block aes
>> encrypt/decrypt in our hardware (CPACF), we're very close to the resolution of
>> our CPU clock.
>>
>> Disclaimer: The encryption/decryption of one block takes ~32ns (~500MB/s).
>> These numbers should be taken with some care, as on s390 the operating system
>> always runs virtualized. In my test environment, I also only have access to a
>> machine with shared CPUs, so there might be some negative impact from other
>> workload.
> 
> The impact of other workloads is much less likely for a short test,
> and if it does happen you are likely to see a value that is abnormally large.
> 
>> The benchmark loops for 100 iterations now without any warm-up. In each
>> iteration, I measure a single aes_encrypt()/aes_decrypt() call. The lowest
>> value of these measurements is takes as the value for the bandwidth
>> calculations. Although it is not necessary in my environment, I'm doing all
>> iterations with preemption disabled. I think, that this might help on other
>> platforms to reduce the jitter of the measurement values.
>>
>> The removal of the warm-up does not have any impact on the numbers.
> 
> I'm not sure what the 'warm-up' was for.
> The first test will be slow(er) due to I-cache misses.
> (That will be more noticeable for big software loops - like blake2.)
> Change to test parameters can affect branch prediction but that also only
> usually affects the first test with each set of parameters.
> (Unlikely to affect AES, but I could see that effect when testing
> mul_u64_u64_div_u64().)
> The only other reason for a 'warm-up' is to get the cpu frequency fast
> and fixed - and there ought to be a better way of doing that.
> 
>>
>> Just for information: I also tried to measure the cycles with the same
>> results. The minimal measurement value of a few iterations is much more stable
>> that the average over a larger number of iterations.
> 
> My userspace test code runs each test 10 times and prints all 10 values.
> I then look at them to see how consistent they are.
> 
>> I also did some tests with IRQs disabled (instead of only preemption), but the
>> numbers stay the same. So I think, it is save enough to stay with disables
>> preemption.
> 
> I'd actually go for disabling interrupts.
> What you are seeing is the effect of interrupts not happening
> (which is likely for a short test, but not for a long one).

Ok, I'll send the next series with IRQ disabled. I don't see any difference on
my systems.

>> I also tried you idea, first to do a few measurements and if they are fast
>> enough, increase the number of iterations. But it turns out, that this it not
>> really necessary (at least in my env). But I can add this, it it makes sense
>> on other platforms.
> 
> The main reason for doing that is reducing the time the tests take on a
> system that is massively slower (and doing software AES).
> Maybe someone want to run the test cases on an m68k :-)

So I've currently 100 iterations. The first one or two iterations will be for
the warm-up (cache misses, branch prediction, etc). But with the interrupts
disabled, the rest of the iterations should give us enough stable measurements
for the benchmark. Maybe it would be worth to test the next version of teh
test on other platforms as well.

-- 
Mit freundlichen Grüßen / Kind regards
Holger Dengler
--
IBM Systems, Linux on IBM Z Development
dengler@...ux.ibm.com


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ