lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b70068c3-3b92-4c90-a439-d3f8582953e8@kylinos.cn>
Date: Mon, 26 Jan 2026 10:52:54 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: David Laight <david.laight.linux@...il.com>,
 Paul Walmsley <pjw@...nel.org>
Cc: palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 samuel.holland@...ive.com, charlie@...osinc.com, conor.dooley@...rochip.com,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] riscv: lib: optimize strlen loop efficiency

On 2026/1/16 02:46, David Laight wrote:
> On Thu, 15 Jan 2026 11:19:47 +0000
> David Laight <david.laight.linux@...il.com> wrote:
> 
>> For 64bit you can do a lot better (in C) by loading 64bit words and doing
>> the correct 'shift and mask' sequence to detect a zero byte.
>> It usually isn't worth in for 32bit.
>>
>> Does need to handle a mis-aligned base - eg by masking the bits off
>> the base pointer and or'ing in non-zero values to the value read from
>> the base pointer.
>>
>> 	David
> 
> The version below seems to work https://www.godbolt.org/z/sME3Ts6vW
> It actually looks ok for x86-32, the loop is 8 instructions plus the branch
> but the 'register dependency chain' is only 4 instructions.
> So maybe better than byte compares for moderate to long strings.
> (Especially if the cpu starts speculatively executing the next loop
> iteration.)
> 
> The OPTIMIZER_HIDE_VAR() helps a lot on (eg) MIPS-64 and a bit elsewhere
> since most 64bit cpu can't load 64bit immediates.
> 
> I can't get gcc and clang to reliably have a loop with a conditional
> jump at the bottom, especially with an unconditional jump into the
> loop (to remove the '| mask' from the loop body).
> 
> Also KASAN (or one of its friends) wont like the code reading entire
> words that hold the string.
> 
> And it does need ffs/clz instructions - or a different loop bottom.
> (For BE one with clzl() returning 0 will work.) 
> 
> While I suspect the per-byte cost is 'two bytes/clock' on x86-64
> the fixed cost may move the break-even point above the length of the
> average strlen() in the kernel.
> Of course, x86 probably falls back to 'rep scasb' at (maybe)
> (40 + 2n) clocks for 'n' bytes.
> A carefully written slightly unrolled asm loop might manage one
> byte per clock!
> I could spend weeks benchmarking different versions.
> 
> 	David
> 
> #define OPTIMIZER_HIDE_VAR(var)                                         \
>         __asm__ ("" : "=r" (var) : "0" (var))
> 
> /* Set BE to test big-endian on little-endian.
>  * For real BE either do a byteswapping read or use the BE code. */
> #ifdef BE
> #define SWP(x) __builtin_bswap64(x)
> #define SHIFT <<
> #else
> #define SWP(x) (x)
> #define SHIFT >>
> #endif
> 
> unsigned long my_strlen(const char *s)
> {
>         unsigned int off = (unsigned long)s % sizeof (long);
>         const unsigned long *p = (void *)(s - off);
>         unsigned long val;
>         unsigned long mask;
>         unsigned long ones = 0x01010101;
>         /* Force the compiler to generate the related constants sanely. */
>         OPTIMIZER_HIDE_VAR(ones);
>         ones |= ones << 16 << 16;
> 
>         mask = ((~0ul SHIFT 8) SHIFT 8 * (sizeof (long) - 1 - off));
>         do {
>                 val = SWP(*p++) | mask;
>                 mask = (val - ones) & ~val & ones << 7;
>         } while (!mask);
> 
> #ifdef BE
>         off = __builtin_clzl(mask);
>         /* Correct for "...\x01" */
>         val <<= off;
>         for (off /= 8; val > (~0ul >> 8); off++)
>                 val <<= 8;
> #else
>         off = (__builtin_ffsl(mask) - 1)/8;
> #endif
> 
>         return (const char *)(p - 1) + off - s;
> }

Thank you very much for your detailed explanation and the insightful examples!

Your explanation of instruction cycles and performance trade-offs has been very
enlightening. As I mentioned in my other patch series, I really appreciate the
technical depth you've shared. I will definitely spend some time studying this
area further to better understand these performance trade-offs and see if I can
contribute more meaningful improvements in the future.

Thanks again for your time and for sharing your expertise!

-- 
With Best Regards,
Feng Jiang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ