linux-kernel - Re: [PATCH -tip 3/3] x86/asm/32: Modernize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250507212807.568551f2@pumpkin>
Date: Wed, 7 May 2025 21:28:07 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, Thomas Gleixner
 <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov
 <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin"
 <hpa@...or.com>
Subject: Re: [PATCH -tip 3/3] x86/asm/32: Modernize _memcpy()

On Tue, 6 May 2025 19:34:34 +0200
Uros Bizjak <ubizjak@...il.com> wrote:

> On Tue, May 6, 2025 at 6:52 PM Uros Bizjak <ubizjak@...il.com> wrote:
> >
> > Use inout "+" constraint modifier where appropriate, declare
> > temporary variables as unsigned long and rewrite parts of assembly
> > in plain C. The memcpy() function shrinks by 10 bytes, from:
> >
> > 00e778d0 <memcpy>:
> >   e778d0:       55                      push   %ebp
> >   e778d1:       89 e5                   mov    %esp,%ebp
> >   e778d3:       83 ec 0c                sub    $0xc,%esp
> >   e778d6:       89 5d f4                mov    %ebx,-0xc(%ebp)
> >   e778d9:       89 c3                   mov    %eax,%ebx
> >   e778db:       89 c8                   mov    %ecx,%eax
> >   e778dd:       89 75 f8                mov    %esi,-0x8(%ebp)
> >   e778e0:       c1 e9 02                shr    $0x2,%ecx
> >   e778e3:       89 d6                   mov    %edx,%esi
> >   e778e5:       89 7d fc                mov    %edi,-0x4(%ebp)
> >   e778e8:       89 df                   mov    %ebx,%edi
> >   e778ea:       f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
> >   e778ec:       89 c1                   mov    %eax,%ecx
> >   e778ee:       83 e1 03                and    $0x3,%ecx
> >   e778f1:       74 02                   je     e778f5 <memcpy+0x25>
> >   e778f3:       f3 a4                   rep movsb %ds:(%esi),%es:(%edi)

Hmmm....
IIRC you really don't want to be doing a [1..3] byte 'rep movsb' there.
Some cpu will run it quickly - but most of those will do a 'rep movsb' faster.

It would also be interesting to try to measure the cost of the 'je'
being mispredicted.
I bet a beer or two that at least one cpu can't abort the setup cost
of the 'rep movsb' - so you take the full hit.

I do need to rerun my 'rep movsb' performance measurements using data
dependencies (not lfence/mfence) to synchronise things.
The 'before' data dependency is easy: count += (clocks & volatile_zero).
The 'after' one can be done the same way if using the performance counters.
It is probably enough to use the updated value of %si or %di rather than
doing a read-back of the last memory write.

I've done that for a different function and can see how the clock count
for divide depends on its arguments.

	David