[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a210197c479e48778672aa13287eef88@AcuMS.aculab.com>
Date: Sun, 17 Dec 2023 18:10:54 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Ivan Orlov' <ivan.orlov0322@...il.com>, "paul.walmsley@...ive.com"
<paul.walmsley@...ive.com>, "palmer@...belt.com" <palmer@...belt.com>,
"aou@...s.berkeley.edu" <aou@...s.berkeley.edu>
CC: "conor.dooley@...rochip.com" <conor.dooley@...rochip.com>,
"ajones@...tanamicro.com" <ajones@...tanamicro.com>, "samuel@...lland.org"
<samuel@...lland.org>, "alexghiti@...osinc.com" <alexghiti@...osinc.com>,
"linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"skhan@...uxfoundation.org" <skhan@...uxfoundation.org>
Subject: RE: [PATCH] riscv: lib: Optimize 'strlen' function
From: Ivan Orlov
> Sent: 13 December 2023 15:46
Looking at the old code...
> 1:
> - lbu t0, 0(t1)
> - beqz t0, 2f
> - addi t1, t1, 1
> - j 1b
I suspect there is (at least) a two clock stall between
the 'ldu' and 'beqz'.
Allowing for one clock for the 'predicted taken' branch
that is 7 clocks/byte.
Try this one - especially on 32bit:
mov t0, a0
and t1, t0, 1
sub t0, t0, t1
bnez t1, 2f
1:
ldb t1, 0(t0)
2: ldb t2, 1(t0)
add t0, t0, 2
beqz t1, 3f
bnez t2, 1b
add t0, t0, 1
3: sub t0, t0, 2
sub a0, t0, a0
ret
Might be 6 clocks for 2 bytes.
The much smaller cache footprint will also help.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists