linux-kernel - Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250326224527.10105902@pumpkin>
Date: Wed, 26 Mar 2025 22:45:27 +0000
From: David Laight <david.laight.linux@...il.com>
To: Herton Krzesinski <hkrzesin@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Mateusz Guzik
 <mjguzik@...il.com>, x86@...nel.org, tglx@...utronix.de, mingo@...hat.com,
 bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
 overlapping store

On Tue, 25 Mar 2025 19:42:09 -0300
Herton Krzesinski <hkrzesin@...hat.com> wrote:

...
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling, I tried
> to uninline/move copy_user_generic() like this:

If you use the PERF_COUNT_HW_CPU_CYCLES counter bracketed by 'mfence'
you can get reasonably consistent cycle counts for short sequences.

The problem here is that you need the specific cpu that is causing issues.
Probably zen2 or zen3.

Benchmarking 'rep movsb' on a zen5 can be summarised:
Test overhead: 195 clocks ('rep movb' asm with a 'nop') subtracted from the
other values.
  length    clocks
       0       7
   1..3f       5
      40       4
  41..7f       5
  80..1ff     39 (except 16c with is 4 clocks faster!)
      200     38
 201..23f     40
      240     38
 241..27f     41
      280     39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.

With a 'sailing wind' a copy loop should do 8 bytes/clock.
(Faster if the cpu supports more than one write/clock.)
So might be faster for lengths between 128 and ~256.

Misaligning the addresses doesn't usually make any difference.
(There is a small penalty for destinations in the last cache line of a page.)

But there is strange oddity.
If (dest - src) % 4096 is between 1 and 63 then short copies are 55 clocks
jumping to 75 at 128 bytes and then increasing slowly.
(I think that matches what I've seen.)

	David