lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <04b07f3a-ef18-4662-8529-203348ad3b29@kylinos.cn>
Date: Thu, 29 Jan 2026 16:34:33 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: David Laight <david.laight.linux@...il.com>,
 Paul Walmsley <pjw@...nel.org>
Cc: palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 samuel.holland@...ive.com, charlie@...osinc.com, conor.dooley@...rochip.com,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] riscv: lib: optimize strlen loop efficiency

On 2026/1/29 02:59, David Laight wrote:
> On Thu, 15 Jan 2026 18:46:19 +0000
> David Laight <david.laight.linux@...il.com> wrote:
> 
> ... 
>> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
>> the fixed cost may move the break-even point above the length of the
>> average strlen() in the kernel.
>> Of course, x86 probably falls back to 'rep scasb' at (maybe)
>> (40 + 2n) clocks for 'n' bytes.
>> A carefully written slightly unrolled asm loop might manage one
>> byte per clock!
>> I could spend weeks benchmarking different versions.
> 
> I've spent a quick half-hour...
> 
> On my zen-5 in userspace:
> 
> glibc's strlen() is showing the same fixed cost (50 clocks including overhead)
> for sizes below (about) 100 bytes, for big buffers add 1 clock for ~50 bytes.
> It must be using some simd instructions.
> 
> A simple:
> 	len = 0; while (s[len]) len++; return len;
> loop is about 1 byte/clock, overhead ~25 clocks (probably the mostly one 'rdpmc'
> instruction).
> (Needs a barrier() to stop gcc converting it to a libc call.)
> 
> Unrolling the loop once:
> 	for (len = 0; s[len]; len += 2)
> 		if (!s[len + 1] return len + 1;
> 	return len;
> actually runs twice as fast - so 2 bytes/clock.
> 
> Unrolling 4 times doesn't help, suddenly goes somewhat slower somewhere
> between 128 and 256 bytes (to 1.5 bytes/clock).
> 
> The C 'longs' loop has an overhead of ~45 clocks and does 6 bytes/clock.
> So the is better for buffers longer than 64 bytes.
> 
> The 'elephant in the room' is 'repne scasb'.
> The fixed cost is some 150 clocks and the cost 3 clocks/byte.
> 
> I don't think any of the Intel cpu I have will do a 'one clock loop'.
> I certainly failed to get one in the past when there was a data-dependency
> between the iterations.
> 
> But I don't have anything modern (newest is an i7-7xxx) and I don't have
> any old amd ones.
> I needs to get a zen-1 (or 1a) and one of the Intel system that should be
> cheap because they won't run win-11.

Thank you very much for sharing these detailed test results and your in-depth
analysis. It is truly helpful and inspiring to see how different loop strategies
perform on the wire.

My current priority is the RISC-V patch series. Once this is done, I'd love to
follow up and explore potential improvements for the generic C implementation.

While I don't have many x86 machines at hand either, I do have access to some
ARM64 and LoongArch hardware. I think I can also perform tests and observations
on these platforms later.

Thanks again for the great discussion!

-- 
With Best Regards,
Feng Jiang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ