lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <691a0183-6f41-4956-82da-abb69a449919@kylinos.cn>
Date: Wed, 21 Jan 2026 14:44:21 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: Andy Shevchenko <andriy.shevchenko@...el.com>
Cc: pjw@...nel.org, palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 akpm@...ux-foundation.org, kees@...nel.org, andy@...nel.org,
 ebiggers@...nel.org, martin.petersen@...cle.com, ardb@...nel.org,
 charlie@...osinc.com, conor.dooley@...rochip.com, ajones@...tanamicro.com,
 linus.walleij@...aro.org, nathan@...nel.org,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-hardening@...r.kernel.org
Subject: Re: [PATCH v3 0/8] riscv: optimize string functions and add kunit
 tests

On 2026/1/20 15:36, Andy Shevchenko wrote:
> On Tue, Jan 20, 2026 at 02:58:44PM +0800, Feng Jiang wrote:
>> This series provides optimized implementations of strnlen(), strchr(),
>> and strrchr() for the RISC-V architecture. The strnlen implementation
>> is derived from the existing optimized strlen. For strchr and strrchr,
> 
> strchr() and strrchr()
>>> the current versions use simple byte-by-byte assembly logic, which
>> will serve as a baseline for future Zbb-based optimizations.
>>
>> The patch series is organized into three parts:
>> 1. Correctness Testing: The first three patches add KUnit test cases
>>    for strlen, strnlen, and strrchr to ensure the baseline and optimized
> 
> strlen(), strnlen(), and strrchr()

Will fix these to include parentheses consistently in the next version.

>> Performance Summary (Improvement %):
>> ---------------------------------------------------------------
>> Function  |  16 B (Short) |  512 B (Mid) |  4096 B (Long)
>> ---------------------------------------------------------------
>> strnlen   |    +64.0%     |   +346.2%    |    +410.7%
> 
> This is still suspicious.
> 

Regarding the +410% gain, it becomes clearer when looking at the inner loop of the
Zbb implementation (https://docs.riscv.org/reference/isa/unpriv/b-st-ext.html#zbb).
For a 64-bit system, the core loop uses only 5 instructions to process 8 bytes.
Note that t3 is pre-loaded with -1 (0xFFFFFFFFFFFFFFFF):

1:
    REG_L   t1, SZREG(t0)      /* Load 8 bytes */
    addi    t0, t0, SZREG      /* Advance pointer */
    orc.b   t1, t1             /* Bitwise OR-Combine: 0x00 becomes 0x00, others 0xFF */
    bgeu    t0, t4, 4f         /* Boundary check (max_len) */
    beq     t1, t3, 1b         /* If t1 == 0xFFFFFFFFFFFFFFFF (no NUL), loop */

In contrast, the generic C implementation performs byte-by-byte comparisons, which involves
significantly more loads and branches for every single byte processed. The Zbb approach is
much leaner: the orc.b instruction collapses the NUL-check for an entire 8-byte word into a
single step. By shifting from a byte-oriented loop to this hardware-accelerated word-at-a-time
logic, we drastically reduce the instruction count and branch overhead, which explains the
massive jump in TCG throughput for long strings.

Beyond the main loop, the Zbb implementation also utilizes ctz (Count Trailing Zeros)
to handle the tail and alignment. Once orc.b identifies a NUL byte within a register,
we can precisely locate its position in just two instructions:

    not     t1, t1             /* Flip bits: NUL byte (0x00) becomes 0xFF */
    ctz     t1, t1             /* Count bits before the first NUL byte */
    srli    t1, t1, 3          /* Divide by 8 to get byte offset */

In a generic C implementation, calculating this byte offset typically requires a series
of shifts, masks, or a sub-loop, which adds significant overhead. By combining orc.b and
ctz, we eliminate all branching and lookup tables for the tail-end calculation, further
contributing to the performance gains observed in the benchmarks.

>> strchr    |    +4.0%      |   +6.4%      |    +1.5%
>> strrchr   |    +6.6%      |   +2.8%      |    +0.0%

As for strchr() and strrchr(), the relatively modest improvements are because the current
versions in this series are implemented using simple byte-by-byte assembly. These primarily
gain performance by reducing function call overhead and eliminating stack frame management
compared to the generic C version.

Unlike strnlen(), they do not yet utilize Zbb extensions. I plan to introduce Zbb-optimized
versions for these functions in a future patch set, which I expect will bring performance
gains similar to what we now see with strnlen().

>> word-at-a-time logic, showing significant gains as the string length
>> increases.
> 
> Hmm... Have you tried to optimise the generic implementation to use
> word-at-a-time logic and compare?
> 

Regarding the generic implementation, even if we were to optimize the C code
to use word-at-a-time logic (the has_zero() style bit-manipulation), it still
wouldn't match the Zbb version's efficiency.

The traditional C-based word-level detection requires a sequence of arithmetic
operations to identify NUL bytes. In contrast, the RISC-V orc.b instruction
collapses this entire check into a single hardware cycle. I've focused on this
architectural approach to fully leverage these specific Zbb features, which
provides a level of instruction density that generic C math cannot achieve.

>> For strchr and strrchr, the handwritten assembly reduces
> 
> strchr() and strrchr()
> 
>> fixed overhead by eliminating stack frame management. The gain is most
>> prominent on short strings (1-16B) where function call overhead dominates,
>> while the performance converges with the C implementation for longer
>> strings in the TCG environment.
> 

Thanks for your detailed review!

-- 
With Best Regards,
Feng Jiang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ