lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6db87d0e-bfb2-4d8e-88a8-a727f96aa2d5@kylinos.cn>
Date: Wed, 14 Jan 2026 16:05:38 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: Andy Shevchenko <andriy.shevchenko@...el.com>
Cc: pjw@...nel.org, palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 kees@...nel.org, andy@...nel.org, akpm@...ux-foundation.org,
 ebiggers@...nel.org, martin.petersen@...cle.com, ardb@...nel.org,
 ajones@...tanamicro.com, conor.dooley@...rochip.com,
 samuel.holland@...ive.com, linus.walleij@...aro.org, nathan@...nel.org,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-hardening@...r.kernel.org
Subject: Re: [PATCH v2 08/14] lib/string_kunit: add performance benchmark for
 strlen()

On 2026/1/14 15:21, Andy Shevchenko wrote:
> On Wed, Jan 14, 2026 at 03:04:58PM +0800, Feng Jiang wrote:
>> On 2026/1/14 14:14, Feng Jiang wrote:
>>> On 2026/1/13 16:46, Andy Shevchenko wrote:
> 
> ...
> 
>>> Thank you for the catch. You are absolutely correct—the 2500x figure is heavily
>>> distorted and does not reflect real-world performance.
>>>
>>> I've found that by using a volatile function pointer to call the implementations
>>> (instead of direct calls), the results returned to a realistic range. It appears
>>> the previous benchmark logic allowed the compiler to over-optimize the test loop
>>> in ways that skewed the data.
>>>
>>> I will refactor the benchmark logic in v3, specifically referencing the crc32
>>> KUnit implementation (e.g., using warm-up loops and adding preempt_disable()
>>> to eliminate context-switch interference) to ensure the data is robust and accurate.
>>>
>>
>> Just a quick follow-up: I've also verified that using a volatile variable to store
>> the return value (as seen in crc_benchmark()) is equally effective at preventing
>> the optimization.
>>
>> The core change is as follows:
>>
>>     volatile size_t len;
>>     ...
>>     for (unsigned int j = 0; j < iters; j++) {
>>         OPTIMIZER_HIDE_VAR(buf);
>>         len = strlen(buf);
> 
> But please, check for sure this is Linux kernel generic implementation (before)
> and not __builtin_strlen() from GCC. (OTOH, it would be nice to benchmark that
> one as well, although I think that __builtin_strlen() in general maybe slightly
> better choice than Linux kernel generic implementation.) I.o.w. be sure *what*
> you test.
> 

Thanks for the reminder. I actually verified this with objdump and gdb before
submitting the patch—the calls are indeed hitting the intended arch-specific
strlen symbols, not the compiler's __builtin_strlen(). I missed mentioning this
detail in my previous email.

I also just performed an additional test by explicitly calling the exported
arch-specific __pi_strlen() symbol, and the results remained consistent.

Results with riscv __pi_strlen():

    ok 4 string_test_strlen
    # string_test_strlen_bench: strlen performance (short, len: 8, iters: 100000):
    # string_test_strlen_bench:   arch-optimized: 4650500 ns
    # string_test_strlen_bench:   generic C:      5776000 ns
    # string_test_strlen_bench:   speedup:        1.24x
    # string_test_strlen_bench: strlen performance (medium, len: 64, iters: 100000):
    # string_test_strlen_bench:   arch-optimized: 6895000 ns
    # string_test_strlen_bench:   generic C:      16343400 ns
    # string_test_strlen_bench:   speedup:        2.37x
    # string_test_strlen_bench: strlen performance (long, len: 2048, iters: 10000):
    # string_test_strlen_bench:   arch-optimized: 8052800 ns
    # string_test_strlen_bench:   generic C:      35290700 ns
    # string_test_strlen_bench:   speedup:        4.38x
    ok 5 string_test_strlen_bench

>>     }
> 
> Or using WRITE_ONCE() :-) But that one will probably be confusing as it usually
> should be paired with READ_ONCE() somewhere else in the code. So, I agree on
> crc_benchmark() approach taken.
> 

Thanks for the guidance. I'll stick with the crc_benchmark() pattern to avoid any
potential confusion regarding concurrency that WRITE_ONCE() might imply.

I'm still learning the most idiomatic practices in the kernel, so I appreciate the tip.

>> Preliminary results with this change look much more reasonable:
>>
>>     ok 4 string_test_strlen
>>     # string_test_strlen_bench: strlen performance (short, len: 8, iters: 100000):
>>     # string_test_strlen_bench:   arch-optimized: 4767500 ns
>>     # string_test_strlen_bench:   generic C:      5815800 ns
>>     # string_test_strlen_bench:   speedup:        1.21x
>>     # string_test_strlen_bench: strlen performance (medium, len: 64, iters: 100000):
>>     # string_test_strlen_bench:   arch-optimized: 6573600 ns
>>     # string_test_strlen_bench:   generic C:      16342500 ns
>>     # string_test_strlen_bench:   speedup:        2.48x
>>     # string_test_strlen_bench: strlen performance (long, len: 2048, iters: 10000):
>>     # string_test_strlen_bench:   arch-optimized: 7931000 ns
>>     # string_test_strlen_bench:   generic C:      35347300 ns
>>     # string_test_strlen_bench:   speedup:        4.45x
>>     ok 5 string_test_strlen_bench
>>
>> I will adopt this pattern in v3, along with cache warm-up and preempt_disable(),
>> to stay consistent with existing kernel benchmarks and ensure robust measurements.
> 

-- 
With Best Regards,
Feng Jiang


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ