linux-kernel - Re: [PATCH v3 0/8] riscv: optimize string functions and add kunit tests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHp75Ve_pyVm430FL=aEAw1Cnf-92T3Y23qh7pEOaMYMp9iyvw@mail.gmail.com>
Date: Wed, 21 Jan 2026 09:01:29 +0200
From: Andy Shevchenko <andy.shevchenko@...il.com>
To: Feng Jiang <jiangfeng@...inos.cn>
Cc: Andy Shevchenko <andriy.shevchenko@...el.com>, pjw@...nel.org, palmer@...belt.com, 
	aou@...s.berkeley.edu, alex@...ti.fr, akpm@...ux-foundation.org, 
	kees@...nel.org, andy@...nel.org, ebiggers@...nel.org, 
	martin.petersen@...cle.com, ardb@...nel.org, charlie@...osinc.com, 
	conor.dooley@...rochip.com, ajones@...tanamicro.com, linus.walleij@...aro.org, 
	nathan@...nel.org, linux-riscv@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 0/8] riscv: optimize string functions and add kunit tests

On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <jiangfeng@...inos.cn> wrote:
> On 2026/1/20 15:36, Andy Shevchenko wrote:
> > On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:

...

> >> Performance Summary (Improvement %):
> >> ---------------------------------------------------------------
> >> Function  |  16 B (Short) |  512 B (Mid) |  4096 B (Long)
> >> ---------------------------------------------------------------
> >> strnlen   |    +64.0%     |   +346.2%    |    +410.7%
> >
> > This is still suspicious.
>
> Regarding the +410% gain, it becomes clearer when looking at the inner loop of the
> Zbb implementation (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb).
> For a 64-bit system, the core loop uses only 5 instructions to process 8 bytes.
> Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF):
>
> 1:
>     REG_L   t1, SZREG(t0)      /* Load 8 bytes */
>     addi    t0, t0, SZREG      /* Advance pointer */
>     orc.b   t1, t1             /* Bitwise OR-Combine: 0x00 becomes 0x00, others 0xFF */
>     bgeu    t0, t4, 4f         /* Boundary check (max_len) */
>     beq     t1, t3, 1b         /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop */
>
> In contrast, the generic C implementation performs byte-by-byte comparisons, which involves
> significantly more loads and branches for every single byte processed. The Zbb approach is
> much leaner: the orc.b instruction collapses the NUL-check for an entire 8-byte word into a
> single step. By shifting from a byte-oriented loop to this hardware-accelerated word-at-a-time
> logic, we drastically reduce the instruction count and branch overhead, which explains the
> massive jump in TCG throughput for long strings.
>
> Beyond the main loop, the Zbb implementation also utilizes ctz (Count Trailing Zeros)
> to handle the tail and alignment. Once orc.b identifies a NUL byte within a register,
> we can precisely locate its position in just two instructions:
>
>     not     t1, t1             /* Flip bits: NUL byte (0x00) becomes 0xFF */
>     ctz     t1, t1             /* Count bits before the first NUL byte */
>     srli    t1, t1, 3          /* Divide by 8 to get byte offset */
>
> In a generic C implementation, calculating this byte offset typically requires a series
> of shifts, masks, or a sub-loop, which adds significant overhead. By combining orc.b and
> ctz, we eliminate all branching and lookup tables for the tail-end calculation, further
> contributing to the performance gains observed in the benchmarks.
>
> >> strchr    |    +4.0%      |   +6.4%      |    +1.5%
> >> strrchr   |    +6.6%      |   +2.8%      |    +0.0%
>
> As for strchr() and strrchr(), the relatively modest improvements are because the current
> versions in this series are implemented using simple byte-by-byte assembly. These primarily
> gain performance by reducing function call overhead and eliminating stack frame management
> compared to the generic C version.
>
> Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce Zbb-optimized
> versions for these functions in a future patch set, which I expect will bring performance
> gains similar to what we now see with strnlen().

Thanks for the details regarding the assembly native implementation.

> >> word-at-a-time logic, showing significant gains as the string length
> >> increases.
> >
> > Hmm... Have you tried to optimise the generic implementation to use
> > word-at-a-time logic and compare?
>
> Regarding the generic implementation, even if we were to optimize the C code
> to use word-at-a-time logic (the has_zero() style bit-manipulation), it still
> wouldn't match the Zbb version's efficiency.
>
> The traditional C-based word-level detection requires a sequence of arithmetic
> operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction
> collapses this entire check into a single hardware cycle. I've focused on this
> architectural approach to fully leverage these specific Zbb features, which
> provides a level of instruction density that generic C math cannot achieve.

I understand that. My point is if we move the generic implementation
to use word-at-a-time technique the difference should not go 4x,
right? Perhaps 1.5x or so. I believe this will be a very useful
exercise.

-- 
With Best Regards,
Andy Shevchenko