lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26d08954-2826-4c48-a540-6e0857faf5ec@kylinos.cn>
Date: Fri, 23 Jan 2026 11:12:00 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: David Laight <david.laight.linux@...il.com>,
 Andy Shevchenko <andy.shevchenko@...il.com>
Cc: Andy Shevchenko <andriy.shevchenko@...el.com>, pjw@...nel.org,
 palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 akpm@...ux-foundation.org, kees@...nel.org, andy@...nel.org,
 ebiggers@...nel.org, martin.petersen@...cle.com, ardb@...nel.org,
 charlie@...osinc.com, conor.dooley@...rochip.com, ajones@...tanamicro.com,
 linus.walleij@...aro.org, nathan@...nel.org,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 0/8] riscv: optimize string functions and add kunit
 tests

On 2026/1/21 18:57, David Laight wrote:
> On Wed, 21 Jan 2026 09:01:29 +0200
> Andy Shevchenko <andy.shevchenko@...il.com> wrote:
> 
> ...
>> I understand that. My point is if we move the generic implementation
>> to use word-at-a-time technique the difference should not go 4x,
>> right? Perhaps 1.5x or so. I believe this will be a very useful
>> exercise.
> 
> I posted a version earlier.
> 
> After the initial setup (aligning the base address and loading
> some constants the loop on x86-64 is 7 instructions (should be similar
> for other architectures).
> I think it will execute in 4 clocks.
> You then need to find the byte in the word, easy enough on LE with
> a fast ffs() - but harder otherwise.
> The real problem is the cost for short strings.
> Like memcpy() you need a hint from the source of the 'expected' length
> (as a compile-time constant) to compile-time select the algorithm.
> 
> OTOH:
> 	for (;;) {
> 		if (!ptr[0]) return ptr - start;
> 		ptr += 2;
> 	while (ptr[-1]);
> 	return ptr - start - 1;
> has two 'load+compare+branch' and one add per loop.
> On x86 that might all overlap and give you a two-clock loop
> that checks one byte every clock - faster than 'rep scasb'.
> (You can get a two clock loop, but not a 1 clock loop.)
> I think unrolling further will make little/no difference.
> 
> The break-even for the word-at-a-time version is probably at least 64
> characters.
> 

Thanks for the profound analysis and the detailed suggestions.

I am still exploring some of the finer points of these low-level performance
trade-offs, so your input is very helpful. I will definitely spend some time
studying this further and experiment with the approaches you mentioned.

Thanks again for your help!

-- 
With Best Regards,
Feng Jiang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ