lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250326224527.10105902@pumpkin>
Date: Wed, 26 Mar 2025 22:45:27 +0000
From: David Laight <david.laight.linux@...il.com>
To: Herton Krzesinski <hkrzesin@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Mateusz Guzik
 <mjguzik@...il.com>, x86@...nel.org, tglx@...utronix.de, mingo@...hat.com,
 bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
 overlapping store

On Tue, 25 Mar 2025 19:42:09 -0300
Herton Krzesinski <hkrzesin@...hat.com> wrote:

...
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling, I tried
> to uninline/move copy_user_generic() like this:

If you use the PERF_COUNT_HW_CPU_CYCLES counter bracketed by 'mfence'
you can get reasonably consistent cycle counts for short sequences.

The problem here is that you need the specific cpu that is causing issues.
Probably zen2 or zen3.

Benchmarking 'rep movsb' on a zen5 can be summarised:
Test overhead: 195 clocks ('rep movb' asm with a 'nop') subtracted from the
other values.
  length    clocks
       0       7
   1..3f       5
      40       4
  41..7f       5
  80..1ff     39 (except 16c with is 4 clocks faster!)
      200     38
 201..23f     40
      240     38
 241..27f     41
      280     39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.

With a 'sailing wind' a copy loop should do 8 bytes/clock.
(Faster if the cpu supports more than one write/clock.)
So might be faster for lengths between 128 and ~256.

Misaligning the addresses doesn't usually make any difference.
(There is a small penalty for destinations in the last cache line of a page.)

But there is strange oddity.
If (dest - src) % 4096 is between 1 and 63 then short copies are 55 clocks
jumping to 75 at 128 bytes and then increasing slowly.
(I think that matches what I've seen.)

	David



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ