lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 12 May 2023 11:04:26 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'zhangfei' <zhang_fei_0403@....com>,
        "ajones@...tanamicro.com" <ajones@...tanamicro.com>
CC:     "aou@...s.berkeley.edu" <aou@...s.berkeley.edu>,
        "conor.dooley@...rochip.com" <conor.dooley@...rochip.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
        "palmer@...belt.com" <palmer@...belt.com>,
        "paul.walmsley@...ive.com" <paul.walmsley@...ive.com>,
        "zhangfei@...iscas.ac.cn" <zhangfei@...iscas.ac.cn>
Subject: RE: [PATCH v2 2/2] RISC-V: lib: Optimize memset performance

From: zhangfei
> Sent: 12 May 2023 09:51
...
> 2.Storing parallelism and reducing jumps will compensate for the cost of redundant
> stores. Based on the current multiple test results, regardless of which bytes I
> modify to check, its performance is better than byte by byte storage.
> 
> 3.From the above experiment, for the detection of 2, 6, 8, 11, and 14, its overall
> performance is the best.

I'm surprised the RISC-V cpu support parallel stores.
Typical x86 desktop cpu can only do single store (and two loads) every clock.
Clearly doing writes offset from both ends of the buffer does
reduce the number of control instructions relative to the stores.

Since memory writes can easily be queued I'd expect that your
'aim' would be one write every clock.
Achieving that requires knowledge of which instructions can execute in
parallel and the delays associated with correctly predicted branches.
That will very much depend on which RISV-V cpu you have.
Since any loop is at least two instructions (addi+blt) you almost
certainly need at least two writes per iteration.

I do think you are missing a trick though.
IIRC some RISC-V cpu properly support misaligned writes.
In that case, for long enough memset you can do something like:
	end = start + length;
	*(u64 *)start = 0
	start = (start + 24) & ~15;
	do {
		*(u64 *)(start - 16) = 0;
		*(u64 *)(start - 8) = 0;
		start += 16;
	} while (start < end);
	*(u64 *)(end - 16) = 0;
	*(u64 *)(end - 8) = 0;

> Because I am not a chip designer, I find it difficult to answer specific energy
> consumption costs. Do you have any suggestions and how to conduct testing in this
> regard? I think although storage has increased, there has been a corresponding
> reduction in jumps and the use of pipelines.

Energy use will pretty much depend on the number of clocks.
Anything else will be 2nd order noise.

What does make a difference is that increasing the code size
evicts other code from the I-cache.
This has a knock-on effect on overall system performance.
So while massively unrolling a loop will improve a benchmark
(especially if it is run 'hot-cache') there can be negative
effects on overall system performance.
The code size here probably won't have a measurable effect but
unroll to many kb and the effect can get pronounced.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ