linux-kernel - Re: [PATCH v2] KUnit: memcpy: add benchmark

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFnufp0YAJTm3E14qMzP87gTg_MTT3cH97U7K8jB+NUoQiQcFA@mail.gmail.com>
Date: Fri, 30 Jan 2026 16:01:57 +0100
From: Matteo Croce <technoboy85@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2] KUnit: memcpy: add benchmark

Il giorno ven 30 gen 2026 alle ore 10:02 David Laight
<david.laight.linux@...il.com> ha scritto:
>
> On Fri, 30 Jan 2026 00:45:39 +0100
> Matteo Croce <technoboy85@...il.com> wrote:
>
> > Add optional benchmarks for memcpy() and memmove() functions.
> > Each benchmark is run twice: first with buffers aligned and then with
> > buffers unaligned, to spot unaligned accesses on platforms where they
> > have a noticeable performance impact.
> ...
> > +#ifdef CONFIG_MEMCPY_KUNIT_BENCHMARK
> > +
> > +#define COPY_SIZE    (4 * 1024 * 1024)
>
> That is far too big.
> You are timing data-cache loads from memory, not memcpy().
>
> To avoid cache misses you probably want to keep the size below 1k.
> It is also worth timing short and very short transfers (maybe 2 and 16
> bytes) because the fixed overhead can matter more than the transfer
> speed.
> The difference between 256 and 1024 bytes is enough to (reasonably)
> infer the 'cost per byte' for long buffers.
>
> I think I'd time a simple byte copy loop for comparison purposes.
> (You might need a barrier() in the loop to stop gcc changing it.)
>
> Alignment wise, on some Intel x86 systems the only makes a big difference
> to 'rep movsb' is 32byte aligning the destination buffer.
> I don't remember what I got on the zen-5.
>
> 'rep movsb' on my zen-5 has a couple of oddities.
> - There is a small penalty is the destination starts in the last cache
>   line of a page.
> - If (dest - src) % 4096 is between 1 and 63 everything is very
>   much slower.
>
> You might want to explicitly include something for the latter.
> (I found it getting strange timings for misaligned copies.)
>
> Otherwise I got:
>   length    clocks
>        0       7
>    1..3f       5
>       40       4
>   41..7f       5
>   80..1ff     39 (except 16c with is 4 clocks faster!)
>       200     38
>  201..23f     40
>       240     38
>  241..27f     41
>       280     39
> The pattern then continues much the same, increasing by 1 clock every 64 bytes
> with the multiple of 64 being a bit cheaper.
> Those timings subtract off a 'test overhead' that may include some of the setup
> time for 'rep movsb'.
> (I need to do them again using data dependencies instead of lfence.)
>
>         David

I'm currently working on a RISC-V machine which doesn't support
unaligned access.
The RISC-V memcpy fallbacks to byte copy when the buffers are
unaligned, so I'm trying to fix it.
This is what I'm using this benchmark for, to measure the improvements
over the current implementation.

These are the numbers with the stock memcpy and 4 MB buffer:
memcpy: aligned copy of 400 MBytes in 429 msecs (931 MB/s)
memcpy: unaligned copy of 400 MBytes in 1202 msecs (332 MB/s)

These are the numbers with the stock memcpy and 1 KB buffer:
memcpy: aligned copy of 100 KBytes in 39 usecs (2500 MB/s)
memcpy: unaligned copy of 100 KBytes in 125 usecs (793 MB/s)

These are the numbers with the improved memcpy and 4 MB buffer:
memcpy: aligned copy of 400 MBytes in 428 msecs (933 MB/s)
memcpy: unaligned copy of 400 MBytes in 519 msecs (770 MB/s)

These are the numbers with the improved memcpy and 1 KB buffer:
memcpy: aligned copy of 100 KBytes in 44 usecs (2222 MB/s)
memcpy: unaligned copy of 100 KBytes in 55 usecs (1786 MB/s)

If the results depended purely on load times from memory there
wouldn't be this big difference, while the improved version is ~2.3x
faster.
Also, when timing a big transfer I always consistent numbers, with
small ones they float a lot.
These are a series of runs with 4 MB size:
memcpy: aligned copy of 100 KBytes in 39 usecs (2500 MB/s)
memcpy: unaligned copy of 100 KBytes in 125 usecs (793 MB/s)
memcpy: aligned copy of 100 KBytes in 41 usecs (2381 MB/s)
memcpy: unaligned copy of 100 KBytes in 129 usecs (769 MB/s)
memcpy: aligned copy of 100 KBytes in 39 usecs (2500 MB/s)
memcpy: unaligned copy of 100 KBytes in 124 usecs (800 MB/s)
memcpy: aligned copy of 100 KBytes in 39 usecs (2500 MB/s)
memcpy: unaligned copy of 100 KBytes in 128 usecs (775 MB/s)

And these are some 1 KB runs:
memcpy: aligned copy of 100 KBytes in 49 usecs (2040 MB/s)
memcpy: unaligned copy of 100 KBytes in 61 usecs (1639 MB/s)
memcpy: aligned copy of 100 KBytes in 44 usecs (2222 MB/s)
memcpy: unaligned copy of 100 KBytes in 55 usecs (1786 MB/s)
memcpy: aligned copy of 100 KBytes in 41 usecs (2381 MB/s)
memcpy: unaligned copy of 100 KBytes in 53 usecs (1852 MB/s)
memcpy: aligned copy of 100 KBytes in 38 usecs (2564 MB/s)
memcpy: unaligned copy of 100 KBytes in 55 usecs (1786 MB/s)

So, what I could do is to extend the test *also* to lower sizes, like
2 bytes or so.

Regards,
-- 
Matteo Croce

perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay