[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eb71d765d409413887bab48cbd1fc014@AcuMS.aculab.com>
Date: Mon, 16 Sep 2019 14:18:58 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Alexey Dobriyan' <adobriyan@...il.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"bp@...en8.de" <bp@...en8.de>, "hpa@...or.com" <hpa@...or.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"x86@...r.kernel.org" <x86@...r.kernel.org>,
"linux@...musvillemoes.dk" <linux@...musvillemoes.dk>,
"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>
Subject: RE: [PATCH] x86_64: new and improved memset()
From: Alexey Dobriyan
> Sent: 14 September 2019 11:34
...
> +ENTRY(memset0_rep_stosq)
> + xor eax, eax
> +.globl memsetx_rep_stosq
> +memsetx_rep_stosq:
> + lea rsi, [rdi + rcx]
> + shr rcx, 3
> + rep stosq
> + cmp rdi, rsi
> + je 1f
> +2:
> + mov [rdi], al
> + add rdi, 1
> + cmp rdi, rsi
> + jne 2b
> +1:
> + ret
You can do the 'trailing bytes' first with a potentially misaligned store.
Something like (modulo asm syntax and argument ordering):
lea rsi, [rdi + rdx]
shr rcx, 3
jcxz 1f # Short buffer
mov -8[rsi], rax
rep stosq
ret
1:
mov [rdi], al
add rdi, 1
cmp rdi, rsi
jne 1b
ret
The final loop can be one instruction shorter by arranging to do:
1:
mov [rdi+rxx], al
add rdi, 1
jnz 1b
ret
Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus.
OTOH 'loop' is horrid on intel ones.
The same applies to the other versions.
I suspect it isn't worth optimising to realign misaligned buffers
they are unlikely to happen often enough.
I also think that gcc's __builtin version does some of the short
buffer optimisations already.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists