[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFULd4ZaVMgNi7fHC1JzyAiq9bNmXF9aB4Us8X_zU4nMLct=FQ@mail.gmail.com>
Date: Wed, 7 May 2025 23:29:04 +0200
From: Uros Bizjak <ubizjak@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH -tip 3/3] x86/asm/32: Modernize _memcpy()
On Wed, May 7, 2025 at 10:28 PM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Tue, 6 May 2025 19:34:34 +0200
> Uros Bizjak <ubizjak@...il.com> wrote:
>
> > On Tue, May 6, 2025 at 6:52 PM Uros Bizjak <ubizjak@...il.com> wrote:
> > >
> > > Use inout "+" constraint modifier where appropriate, declare
> > > temporary variables as unsigned long and rewrite parts of assembly
> > > in plain C. The memcpy() function shrinks by 10 bytes, from:
> > >
> > > 00e778d0 <memcpy>:
> > > e778d0: 55 push %ebp
> > > e778d1: 89 e5 mov %esp,%ebp
> > > e778d3: 83 ec 0c sub $0xc,%esp
> > > e778d6: 89 5d f4 mov %ebx,-0xc(%ebp)
> > > e778d9: 89 c3 mov %eax,%ebx
> > > e778db: 89 c8 mov %ecx,%eax
> > > e778dd: 89 75 f8 mov %esi,-0x8(%ebp)
> > > e778e0: c1 e9 02 shr $0x2,%ecx
> > > e778e3: 89 d6 mov %edx,%esi
> > > e778e5: 89 7d fc mov %edi,-0x4(%ebp)
> > > e778e8: 89 df mov %ebx,%edi
> > > e778ea: f3 a5 rep movsl %ds:(%esi),%es:(%edi)
> > > e778ec: 89 c1 mov %eax,%ecx
> > > e778ee: 83 e1 03 and $0x3,%ecx
> > > e778f1: 74 02 je e778f5 <memcpy+0x25>
> > > e778f3: f3 a4 rep movsb %ds:(%esi),%es:(%edi)
>
> Hmmm....
> IIRC you really don't want to be doing a [1..3] byte 'rep movsb' there.
> Some cpu will run it quickly - but most of those will do a 'rep movsb' faster.
>
> It would also be interesting to try to measure the cost of the 'je'
> being mispredicted.
> I bet a beer or two that at least one cpu can't abort the setup cost
> of the 'rep movsb' - so you take the full hit.
The intention of the patch was to keep the existing functionality, and
modernize/rewrite the assembly to use inout constraint modifiers. In
passing, some parts of the assembly were converted to plain C. At
least to me, the source code is much easier to read this way, and also
enables some compiler optimizations that result in better assembly.
Perhaps functional improvements can be implemented as follow-up
patches.
Uros.
> I do need to rerun my 'rep movsb' performance measurements using data
> dependencies (not lfence/mfence) to synchronise things.
> The 'before' data dependency is easy: count += (clocks & volatile_zero).
> The 'after' one can be done the same way if using the performance counters.
> It is probably enough to use the updated value of %si or %di rather than
> doing a read-back of the last memory write.
>
> I've done that for a different function and can see how the clock count
> for divide depends on its arguments.
>
> David
Powered by blists - more mailing lists