linux-kernel - RE: [PATCH] x86_64: new and improved memset()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <eb71d765d409413887bab48cbd1fc014@AcuMS.aculab.com>
Date:   Mon, 16 Sep 2019 14:18:58 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Alexey Dobriyan' <adobriyan@...il.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>, "hpa@...or.com" <hpa@...or.com>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "x86@...r.kernel.org" <x86@...r.kernel.org>,
        "linux@...musvillemoes.dk" <linux@...musvillemoes.dk>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>
Subject: RE: [PATCH] x86_64: new and improved memset()

From: Alexey Dobriyan
> Sent: 14 September 2019 11:34
...
> +ENTRY(memset0_rep_stosq)
> +	xor	eax, eax
> +.globl memsetx_rep_stosq
> +memsetx_rep_stosq:
> +	lea	rsi, [rdi + rcx]
> +	shr	rcx, 3
> +	rep stosq
> +	cmp	rdi, rsi
> +	je	1f
> +2:
> +	mov	[rdi], al
> +	add	rdi, 1
> +	cmp	rdi, rsi
> +	jne	2b
> +1:
> +	ret

You can do the 'trailing bytes' first with a potentially misaligned store.
Something like (modulo asm syntax and argument ordering):
	lea	rsi, [rdi + rdx]
	shr	rcx, 3
	jcxz	1f		# Short buffer
	mov	-8[rsi], rax
	rep stosq
	ret
1:
	mov	[rdi], al
	add	rdi, 1
	cmp	rdi, rsi
	jne	1b
	ret

The final loop can be one instruction shorter by arranging to do:
1:
	mov	[rdi+rxx], al
	add	rdi, 1
	jnz	1b
	ret

Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus.
OTOH 'loop' is horrid on intel ones.

The same applies to the other versions.

I suspect it isn't worth optimising to realign misaligned buffers
they are unlikely to happen often enough.

I also think that gcc's __builtin version does some of the short
buffer optimisations already.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)