lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 29 Jan 2022 13:41:52 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'michael@...haelkloos.com'" <michael@...haelkloos.com>,
        Palmer Dabbelt <palmer@...belt.com>,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Albert Ou <aou@...s.berkeley.edu>
CC:     "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v4] riscv: Fixed misaligned memory access.  Fixed pointer
 comparison.

From: michael@...haelkloos.com
...
> [v4]
> 
> I could not resist implementing the optimization I mentioned in
> my v3 notes.  I have implemented the roll over of data by cpu
> register in the misaligned fixup copy loops.  Now, only one load
> from memory is required per iteration of the loop.

I nearly commented...

...
> +	/*
> +	 * Fix Misalignment Copy Loop.
> +	 * load_val1 = load_ptr[0];
> +	 * while (store_ptr != store_ptr_end) {
> +	 *   load_val0 = load_val1;
> +	 *   load_val1 = load_ptr[1];
> +	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
> +	 *   load_ptr++;
> +	 *   store_ptr++;
> +	 * }
> +	 */
> +	REG_L t0, 0x000(a3)
> +	1:
> +	beq   t3, t6, 2f
> +	mv    t1, t0
> +	REG_L t0, SZREG(a3)
> +	srl   t1, t1, a6
> +	sll   t2, t0, a7
> +	or    t1, t1, t2
> +	REG_S t1, 0x000(t3)
> +	addi  a3, a3, SZREG
> +	addi  t3, t3, SZREG
> +	j 1b

No point jumping to a conditional branch that jumps bak
Make this a:
	bne	t3, t6, 1b
and move 1: down one instruction.
(Or is the 'beq' at the top even possible - there is likely to
be an earlier test for zero length copies.)
> +	2:

I also suspect it is worth unrolling the loop once.
You lose the 'mv t1, t0' and one 'addi' for each word transferred.

I think someone mentioned that there is a few clocks delay before
the data from the memory read (REG_L) is actually available.
On in-order cpu this is likely to be a full pipeline stall.
So move the 'addi' up between the 'REG_L' and 'sll' instructions.
(The offset will need to be -SZREG to match.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ