lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFULd4ZaVMgNi7fHC1JzyAiq9bNmXF9aB4Us8X_zU4nMLct=FQ@mail.gmail.com>
Date: Wed, 7 May 2025 23:29:04 +0200
From: Uros Bizjak <ubizjak@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH -tip 3/3] x86/asm/32: Modernize _memcpy()

On Wed, May 7, 2025 at 10:28 PM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Tue, 6 May 2025 19:34:34 +0200
> Uros Bizjak <ubizjak@...il.com> wrote:
>
> > On Tue, May 6, 2025 at 6:52 PM Uros Bizjak <ubizjak@...il.com> wrote:
> > >
> > > Use inout "+" constraint modifier where appropriate, declare
> > > temporary variables as unsigned long and rewrite parts of assembly
> > > in plain C. The memcpy() function shrinks by 10 bytes, from:
> > >
> > > 00e778d0 <memcpy>:
> > >   e778d0:       55                      push   %ebp
> > >   e778d1:       89 e5                   mov    %esp,%ebp
> > >   e778d3:       83 ec 0c                sub    $0xc,%esp
> > >   e778d6:       89 5d f4                mov    %ebx,-0xc(%ebp)
> > >   e778d9:       89 c3                   mov    %eax,%ebx
> > >   e778db:       89 c8                   mov    %ecx,%eax
> > >   e778dd:       89 75 f8                mov    %esi,-0x8(%ebp)
> > >   e778e0:       c1 e9 02                shr    $0x2,%ecx
> > >   e778e3:       89 d6                   mov    %edx,%esi
> > >   e778e5:       89 7d fc                mov    %edi,-0x4(%ebp)
> > >   e778e8:       89 df                   mov    %ebx,%edi
> > >   e778ea:       f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
> > >   e778ec:       89 c1                   mov    %eax,%ecx
> > >   e778ee:       83 e1 03                and    $0x3,%ecx
> > >   e778f1:       74 02                   je     e778f5 <memcpy+0x25>
> > >   e778f3:       f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
>
> Hmmm....
> IIRC you really don't want to be doing a [1..3] byte 'rep movsb' there.
> Some cpu will run it quickly - but most of those will do a 'rep movsb' faster.
>
> It would also be interesting to try to measure the cost of the 'je'
> being mispredicted.
> I bet a beer or two that at least one cpu can't abort the setup cost
> of the 'rep movsb' - so you take the full hit.

The intention of the patch was to keep the existing functionality, and
modernize/rewrite the assembly to use inout constraint modifiers. In
passing, some parts of the assembly were converted to plain C. At
least to me, the source code is much easier to read this way, and also
enables some compiler optimizations that result in better assembly.
Perhaps functional improvements can be implemented as follow-up
patches.

Uros.

> I do need to rerun my 'rep movsb' performance measurements using data
> dependencies (not lfence/mfence) to synchronise things.
> The 'before' data dependency is easy: count += (clocks & volatile_zero).
> The 'after' one can be done the same way if using the performance counters.
> It is probably enough to use the updated value of %si or %di rather than
> doing a read-back of the last memory write.
>
> I've done that for a different function and can see how the clock count
> for divide depends on its arguments.
>
>         David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ