[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250507212807.568551f2@pumpkin>
Date: Wed, 7 May 2025 21:28:07 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, Thomas Gleixner
<tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov
<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin"
<hpa@...or.com>
Subject: Re: [PATCH -tip 3/3] x86/asm/32: Modernize _memcpy()
On Tue, 6 May 2025 19:34:34 +0200
Uros Bizjak <ubizjak@...il.com> wrote:
> On Tue, May 6, 2025 at 6:52 PM Uros Bizjak <ubizjak@...il.com> wrote:
> >
> > Use inout "+" constraint modifier where appropriate, declare
> > temporary variables as unsigned long and rewrite parts of assembly
> > in plain C. The memcpy() function shrinks by 10 bytes, from:
> >
> > 00e778d0 <memcpy>:
> > e778d0: 55 push %ebp
> > e778d1: 89 e5 mov %esp,%ebp
> > e778d3: 83 ec 0c sub $0xc,%esp
> > e778d6: 89 5d f4 mov %ebx,-0xc(%ebp)
> > e778d9: 89 c3 mov %eax,%ebx
> > e778db: 89 c8 mov %ecx,%eax
> > e778dd: 89 75 f8 mov %esi,-0x8(%ebp)
> > e778e0: c1 e9 02 shr $0x2,%ecx
> > e778e3: 89 d6 mov %edx,%esi
> > e778e5: 89 7d fc mov %edi,-0x4(%ebp)
> > e778e8: 89 df mov %ebx,%edi
> > e778ea: f3 a5 rep movsl %ds:(%esi),%es:(%edi)
> > e778ec: 89 c1 mov %eax,%ecx
> > e778ee: 83 e1 03 and $0x3,%ecx
> > e778f1: 74 02 je e778f5 <memcpy+0x25>
> > e778f3: f3 a4 rep movsb %ds:(%esi),%es:(%edi)
Hmmm....
IIRC you really don't want to be doing a [1..3] byte 'rep movsb' there.
Some cpu will run it quickly - but most of those will do a 'rep movsb' faster.
It would also be interesting to try to measure the cost of the 'je'
being mispredicted.
I bet a beer or two that at least one cpu can't abort the setup cost
of the 'rep movsb' - so you take the full hit.
I do need to rerun my 'rep movsb' performance measurements using data
dependencies (not lfence/mfence) to synchronise things.
The 'before' data dependency is easy: count += (clocks & volatile_zero).
The 'after' one can be done the same way if using the performance counters.
It is probably enough to use the updated value of %si or %di rather than
doing a read-back of the last memory write.
I've done that for a different function and can see how the clock count
for divide depends on its arguments.
David
Powered by blists - more mailing lists