lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 24 Mar 2021 19:36:56 +0000
From:   Robin Murphy <robin.murphy@....com>
To:     David Laight <David.Laight@...LAB.COM>,
        Yang Yingliang <yangyingliang@...wei.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc:     "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        "guohanjun@...wei.com" <guohanjun@...wei.com>
Subject: Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge
 128 bytes

On 2021-03-24 16:38, David Laight wrote:
> From: Robin Murphy
>> Sent: 23 March 2021 12:09
>>
>> On 2021-03-23 07:34, Yang Yingliang wrote:
>>> When copy over 128 bytes, src/dst is added after
>>> each ldp/stp instruction, it will cost more time.
>>> To improve this, we only add src/dst after load
>>> or store 64 bytes.
>>
>> This breaks the required behaviour for copy_*_user(), since the fault
>> handler expects the base address to be up-to-date at all times. Say
>> you're copying 128 bytes and fault on the 4th store, it should return 80
>> bytes not copied; the code below would return 128 bytes not copied, even
>> though 48 bytes have actually been written to the destination.
> 
> Are there any non-superscaler amd64 cpu (that anyone cares about)?
> 
> If the cpu can execute multiple instructions in one clock
> then it is usually possible to get the loop control (almost) free.
> 
> You might need to unroll once to interleave read/write
> but any more may be pointless.

Nah, the whole point is that using post-increment addressing is crap in 
the first place because it introduces register dependencies between each 
access that could be avoided entirely if we could use offset addressing 
(and especially crap when we don't even *have* a post-index addressing 
mode for the unprivileged load/store instructions used in copy_*_user() 
and have to simulate it with extra instructions that throw off the code 
alignment).

We already have code that's tuned to work well across our 
microarchitectures[1], the issue is that butchering it to satisfy the 
additional requirements of copy_*_user() with a common template has 
hobbled regular memcpy() performance. I intend to have a crack at fixing 
that properly tomorrow ;)

Robin.

[1] https://github.com/ARM-software/optimized-routines

> So something like:
> 	a = *src++
> 	do {
> 		b = *src++;
> 		*dst++ = a;
> 		a = *src++;
> 		*dst++ = b;
> 	} while (src != lim);
> 	*dst++ = b;
> 
>      David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ