lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260130090222.658adbe4@pumpkin>
Date: Fri, 30 Jan 2026 09:02:22 +0000
From: David Laight <david.laight.linux@...il.com>
To: Matteo Croce <technoboy85@...il.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2] KUnit: memcpy: add benchmark

On Fri, 30 Jan 2026 00:45:39 +0100
Matteo Croce <technoboy85@...il.com> wrote:

> Add optional benchmarks for memcpy() and memmove() functions.
> Each benchmark is run twice: first with buffers aligned and then with
> buffers unaligned, to spot unaligned accesses on platforms where they
> have a noticeable performance impact.
...
> +#ifdef CONFIG_MEMCPY_KUNIT_BENCHMARK
> +
> +#define COPY_SIZE	(4 * 1024 * 1024)

That is far too big.
You are timing data-cache loads from memory, not memcpy().

To avoid cache misses you probably want to keep the size below 1k.
It is also worth timing short and very short transfers (maybe 2 and 16
bytes) because the fixed overhead can matter more than the transfer
speed.
The difference between 256 and 1024 bytes is enough to (reasonably)
infer the 'cost per byte' for long buffers.

I think I'd time a simple byte copy loop for comparison purposes.
(You might need a barrier() in the loop to stop gcc changing it.)

Alignment wise, on some Intel x86 systems the only makes a big difference
to 'rep movsb' is 32byte aligning the destination buffer.
I don't remember what I got on the zen-5.

'rep movsb' on my zen-5 has a couple of oddities.
- There is a small penalty is the destination starts in the last cache
  line of a page.
- If (dest - src) % 4096 is between 1 and 63 everything is very
  much slower.

You might want to explicitly include something for the latter.
(I found it getting strange timings for misaligned copies.)

Otherwise I got:
  length    clocks
       0       7
   1..3f       5
      40       4
  41..7f       5
  80..1ff     39 (except 16c with is 4 clocks faster!)
      200     38
 201..23f     40
      240     38
 241..27f     41
      280     39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.
Those timings subtract off a 'test overhead' that may include some of the setup
time for 'rep movsb'.
(I need to do them again using data dependencies instead of lfence.)

	David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ