[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260130090222.658adbe4@pumpkin>
Date: Fri, 30 Jan 2026 09:02:22 +0000
From: David Laight <david.laight.linux@...il.com>
To: Matteo Croce <technoboy85@...il.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2] KUnit: memcpy: add benchmark
On Fri, 30 Jan 2026 00:45:39 +0100
Matteo Croce <technoboy85@...il.com> wrote:
> Add optional benchmarks for memcpy() and memmove() functions.
> Each benchmark is run twice: first with buffers aligned and then with
> buffers unaligned, to spot unaligned accesses on platforms where they
> have a noticeable performance impact.
...
> +#ifdef CONFIG_MEMCPY_KUNIT_BENCHMARK
> +
> +#define COPY_SIZE (4 * 1024 * 1024)
That is far too big.
You are timing data-cache loads from memory, not memcpy().
To avoid cache misses you probably want to keep the size below 1k.
It is also worth timing short and very short transfers (maybe 2 and 16
bytes) because the fixed overhead can matter more than the transfer
speed.
The difference between 256 and 1024 bytes is enough to (reasonably)
infer the 'cost per byte' for long buffers.
I think I'd time a simple byte copy loop for comparison purposes.
(You might need a barrier() in the loop to stop gcc changing it.)
Alignment wise, on some Intel x86 systems the only makes a big difference
to 'rep movsb' is 32byte aligning the destination buffer.
I don't remember what I got on the zen-5.
'rep movsb' on my zen-5 has a couple of oddities.
- There is a small penalty is the destination starts in the last cache
line of a page.
- If (dest - src) % 4096 is between 1 and 63 everything is very
much slower.
You might want to explicitly include something for the latter.
(I found it getting strange timings for misaligned copies.)
Otherwise I got:
length clocks
0 7
1..3f 5
40 4
41..7f 5
80..1ff 39 (except 16c with is 4 clocks faster!)
200 38
201..23f 40
240 38
241..27f 41
280 39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.
Those timings subtract off a 'test overhead' that may include some of the setup
time for 'rep movsb'.
(I need to do them again using data dependencies instead of lfence.)
David
Powered by blists - more mailing lists