linux-kernel - Re: [PATCH v2] KUnit: memcpy: add benchmark

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260130090222.658adbe4@pumpkin>
Date: Fri, 30 Jan 2026 09:02:22 +0000
From: David Laight <david.laight.linux@...il.com>
To: Matteo Croce <technoboy85@...il.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2] KUnit: memcpy: add benchmark

On Fri, 30 Jan 2026 00:45:39 +0100
Matteo Croce <technoboy85@...il.com> wrote:

> Add optional benchmarks for memcpy() and memmove() functions.
> Each benchmark is run twice: first with buffers aligned and then with
> buffers unaligned, to spot unaligned accesses on platforms where they
> have a noticeable performance impact.
...
> +#ifdef CONFIG_MEMCPY_KUNIT_BENCHMARK
> +
> +#define COPY_SIZE	(4 * 1024 * 1024)

That is far too big.
You are timing data-cache loads from memory, not memcpy().

To avoid cache misses you probably want to keep the size below 1k.
It is also worth timing short and very short transfers (maybe 2 and 16
bytes) because the fixed overhead can matter more than the transfer
speed.
The difference between 256 and 1024 bytes is enough to (reasonably)
infer the 'cost per byte' for long buffers.

I think I'd time a simple byte copy loop for comparison purposes.
(You might need a barrier() in the loop to stop gcc changing it.)

Alignment wise, on some Intel x86 systems the only makes a big difference
to 'rep movsb' is 32byte aligning the destination buffer.
I don't remember what I got on the zen-5.

'rep movsb' on my zen-5 has a couple of oddities.
- There is a small penalty is the destination starts in the last cache
  line of a page.
- If (dest - src) % 4096 is between 1 and 63 everything is very
  much slower.

You might want to explicitly include something for the latter.
(I found it getting strange timings for misaligned copies.)

Otherwise I got:
  length    clocks
       0       7
   1..3f       5
      40       4
  41..7f       5
  80..1ff     39 (except 16c with is 4 clocks faster!)
      200     38
 201..23f     40
      240     38
 241..27f     41
      280     39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.
Those timings subtract off a 'test overhead' that may include some of the setup
time for 'rep movsb'.
(I need to do them again using data dependencies instead of lfence.)

	David