[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHp75Ve_pyVm430FL=aEAw1Cnf-92T3Y23qh7pEOaMYMp9iyvw@mail.gmail.com>
Date: Wed, 21 Jan 2026 09:01:29 +0200
From: Andy Shevchenko <andy.shevchenko@...il.com>
To: Feng Jiang <jiangfeng@...inos.cn>
Cc: Andy Shevchenko <andriy.shevchenko@...el.com>, pjw@...nel.org, palmer@...belt.com,
aou@...s.berkeley.edu, alex@...ti.fr, akpm@...ux-foundation.org,
kees@...nel.org, andy@...nel.org, ebiggers@...nel.org,
martin.petersen@...cle.com, ardb@...nel.org, charlie@...osinc.com,
conor.dooley@...rochip.com, ajones@...tanamicro.com, linus.walleij@...aro.org,
nathan@...nel.org, linux-riscv@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 0/8] riscv: optimize string functions and add kunit tests
On Wed, Jan 21, 2026 at 8:44 AM Feng Jiang <jiangfeng@...inos.cn> wrote:
> On 2026/1/20 15:36, Andy Shevchenko wrote:
> > On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:
...
> >> Performance Summary (Improvement %):
> >> ---------------------------------------------------------------
> >> Function | 16 B (Short) | 512 B (Mid) | 4096 B (Long)
> >> ---------------------------------------------------------------
> >> strnlen | +64.0% | +346.2% | +410.7%
> >
> > This is still suspicious.
>
> Regarding the +410% gain, it becomes clearer when looking at the inner loop of the
> Zbb implementation (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb).
> For a 64-bit system, the core loop uses only 5 instructions to process 8 bytes.
> Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF):
>
> 1:
> REG_L t1, SZREG(t0) /* Load 8 bytes */
> addi t0, t0, SZREG /* Advance pointer */
> orc.b t1, t1 /* Bitwise OR-Combine: 0x00 becomes 0x00, others 0xFF */
> bgeu t0, t4, 4f /* Boundary check (max_len) */
> beq t1, t3, 1b /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop */
>
> In contrast, the generic C implementation performs byte-by-byte comparisons, which involves
> significantly more loads and branches for every single byte processed. The Zbb approach is
> much leaner: the orc.b instruction collapses the NUL-check for an entire 8-byte word into a
> single step. By shifting from a byte-oriented loop to this hardware-accelerated word-at-a-time
> logic, we drastically reduce the instruction count and branch overhead, which explains the
> massive jump in TCG throughput for long strings.
>
> Beyond the main loop, the Zbb implementation also utilizes ctz (Count Trailing Zeros)
> to handle the tail and alignment. Once orc.b identifies a NUL byte within a register,
> we can precisely locate its position in just two instructions:
>
> not t1, t1 /* Flip bits: NUL byte (0x00) becomes 0xFF */
> ctz t1, t1 /* Count bits before the first NUL byte */
> srli t1, t1, 3 /* Divide by 8 to get byte offset */
>
> In a generic C implementation, calculating this byte offset typically requires a series
> of shifts, masks, or a sub-loop, which adds significant overhead. By combining orc.b and
> ctz, we eliminate all branching and lookup tables for the tail-end calculation, further
> contributing to the performance gains observed in the benchmarks.
>
> >> strchr | +4.0% | +6.4% | +1.5%
> >> strrchr | +6.6% | +2.8% | +0.0%
>
> As for strchr() and strrchr(), the relatively modest improvements are because the current
> versions in this series are implemented using simple byte-by-byte assembly. These primarily
> gain performance by reducing function call overhead and eliminating stack frame management
> compared to the generic C version.
>
> Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce Zbb-optimized
> versions for these functions in a future patch set, which I expect will bring performance
> gains similar to what we now see with strnlen().
Thanks for the details regarding the assembly native implementation.
> >> word-at-a-time logic, showing significant gains as the string length
> >> increases.
> >
> > Hmm... Have you tried to optimise the generic implementation to use
> > word-at-a-time logic and compare?
>
> Regarding the generic implementation, even if we were to optimize the C code
> to use word-at-a-time logic (the has_zero() style bit-manipulation), it still
> wouldn't match the Zbb version's efficiency.
>
> The traditional C-based word-level detection requires a sequence of arithmetic
> operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction
> collapses this entire check into a single hardware cycle. I've focused on this
> architectural approach to fully leverage these specific Zbb features, which
> provides a level of instruction density that generic C math cannot achieve.
I understand that. My point is if we move the generic implementation
to use word-at-a-time technique the difference should not go 4x,
right? Perhaps 1.5x or so. I believe this will be a very useful
exercise.
--
With Best Regards,
Andy Shevchenko
Powered by blists - more mailing lists