[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f495519c-c681-4445-aedc-0f44fbd192e8@gmail.com>
Date: Sun, 17 Dec 2023 22:52:47 +0000
From: Ivan Orlov <ivan.orlov0322@...il.com>
To: David Laight <David.Laight@...LAB.COM>,
"paul.walmsley@...ive.com" <paul.walmsley@...ive.com>,
"palmer@...belt.com" <palmer@...belt.com>,
"aou@...s.berkeley.edu" <aou@...s.berkeley.edu>
Cc: "conor.dooley@...rochip.com" <conor.dooley@...rochip.com>,
"ajones@...tanamicro.com" <ajones@...tanamicro.com>,
"samuel@...lland.org" <samuel@...lland.org>,
"alexghiti@...osinc.com" <alexghiti@...osinc.com>,
"linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"skhan@...uxfoundation.org" <skhan@...uxfoundation.org>
Subject: Re: [PATCH] riscv: lib: Optimize 'strlen' function
On 12/17/23 17:00, David Laight wrote:
> From: Ivan Orlov
>> Sent: 13 December 2023 15:46
>>
>> The current non-ZBB implementation of 'strlen' function iterates the
>> memory bytewise, looking for a zero byte. It could be optimized to use
>> the wordwise iteration instead, so we will process 4/8 bytes of memory
>> at a time.
> ...
>> 1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
>> to align it.
>
> An alternative is to mask the address and 'or' in non-zero bytes
> into the first word - might be faster.
>
Hi David,
Yeah, it might be an option, I'll test it. Thanks!
> ...
>> Here you can find the benchmarking results for the VisionFive2 board
>> comparing the old and new implementations of the strlen function.
>>
>> Size: 1 (+-0), mean_old: 673, mean_new: 666
>> Size: 2 (+-0), mean_old: 672, mean_new: 676
>> Size: 4 (+-0), mean_old: 685, mean_new: 659
>> Size: 8 (+-0), mean_old: 682, mean_new: 673
>> Size: 16 (+-0), mean_old: 718, mean_new: 694
> ...
>
> Is that 32bit or 64bit?
> The word-at-a-time strlen() is typically not worth it for 32bit.
>
I tested it on 64-bit board only as it is the only board I have...
I assume the performance gain would be less noticeable on 32bit,
probably the word-oriented function could be even slower than the
byte-oriented one for shorter strings.
However, I'm not sure if any physical 32-bit risc-v boards with Linux
support actually exist at the moment... So the only way to test the
solution on the 32-bit system would be QEMU, and probably it wouldn't be
really representative, right?
But it definitely worth a try and probably I could include a separate
implementation for 32-bit RISC-V which will simply iterate the bytes in
case if QEMU 32-bit test will show significant overhead for
word-oriented function.
> I'd also guess that pretty much all the calls in-kernel are short.
I'm 99% sure they are! However, I believe if word-oriented solution
doesn't introduce performance overhead for shorter strings but works
much faster for longer strings, it still worth an implementation! :)
> You might try counting as: histogram[ilog2(strlen_result)]++
> and seeing what it shows for some workload.
> I bet you (a beer if I see you!) that you won't see many over 1k.
Sounds like a funny experiment, and I accept a bet! Beer is more than
doable as I'm also located in the UK (Manchester).
--
Kind regards,
Ivan Orlov
Powered by blists - more mailing lists