linux-kernel - Re: [PATCH v2 08/14] lib/string_kunit: add performance benchmark for strlen()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a58e97ad-a69e-498d-9382-2be4914569b0@kylinos.cn>
Date: Wed, 14 Jan 2026 15:04:58 +0800
From: Feng Jiang <jiangfeng@...inos.cn>
To: Andy Shevchenko <andriy.shevchenko@...el.com>
Cc: pjw@...nel.org, palmer@...belt.com, aou@...s.berkeley.edu, alex@...ti.fr,
 kees@...nel.org, andy@...nel.org, akpm@...ux-foundation.org,
 ebiggers@...nel.org, martin.petersen@...cle.com, ardb@...nel.org,
 ajones@...tanamicro.com, conor.dooley@...rochip.com,
 samuel.holland@...ive.com, linus.walleij@...aro.org, nathan@...nel.org,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-hardening@...r.kernel.org
Subject: Re: [PATCH v2 08/14] lib/string_kunit: add performance benchmark for
 strlen()

On 2026/1/14 14:14, Feng Jiang wrote:
> On 2026/1/13 16:46, Andy Shevchenko wrote:
>> On Tue, Jan 13, 2026 at 04:27:42PM +0800, Feng Jiang wrote:
>>> Introduce a benchmark to compare the architecture-optimized strlen()
>>> implementation against the generic C version (__generic_strlen).
>>>
>>> The benchmark uses a table-driven approach to evaluate performance
>>> across different string lengths (short, medium, and long). It employs
>>> ktime_get() for timing and get_random_bytes() followed by null-byte
>>> filtering to generate test data that prevents early termination.
>>>
>>> This helps in quantifying the performance gains of architecture-specific
>>> optimizations on various platforms.
>>
>> ...
>>
>>> +static void string_test_strlen_bench(struct kunit *test)
>>> +{
>>> +	char *buf;
>>> +	size_t buf_len, iters;
>>> +	ktime_t start, end;
>>> +	u64 time_arch, time_generic;
>>> +
>>> +	buf_len = get_max_bench_len(bench_cases, ARRAY_SIZE(bench_cases)) + 1;
>>> +
>>> +	buf = kunit_kzalloc(test, buf_len, GFP_KERNEL);
>>> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, buf);
>>> +
>>> +	for (size_t i = 0; i < ARRAY_SIZE(bench_cases); i++) {
>>> +		get_random_nonzero_bytes(buf, bench_cases[i].len);
>>> +		buf[bench_cases[i].len] = '\0';
>>> +
>>> +		iters = bench_cases[i].iterations;
>>> +
>>> +		/* 1. Benchmark the architecture-optimized version */
>>> +		start = ktime_get();
>>> +		for (unsigned int j = 0; j < iters; j++) {
>>> +			OPTIMIZER_HIDE_VAR(buf);
>>> +			(void)strlen(buf);
>>
>> First Q: Are you sure the compiler doesn't replace this with __builtin_strlen() ?
>>
>>> +		}
>>> +		end = ktime_get();
>>> +		time_arch = ktime_to_ns(ktime_sub(end, start));
>>> +
>>> +		/* 2. Benchmark the generic C version */
>>> +		start = ktime_get();
>>> +		for (unsigned int j = 0; j < iters; j++) {
>>> +			OPTIMIZER_HIDE_VAR(buf);
>>> +			(void)__generic_strlen(buf);
>>> +		}
>>
>> Are you sure the warmed up caches do not affect the benchmark? I think you need
>> to flush / make caches dirty or so on each iteration.
>>
>>> +		end = ktime_get();
>>> +		time_generic = ktime_to_ns(ktime_sub(end, start));
>>> +
>>> +		string_bench_report(test, "strlen", &bench_cases[i],
>>> +				time_arch, time_generic);
>>> +	}
>>> +}
>>
>>
> 
> Thank you for the catch. You are absolutely correct—the 2500x figure is heavily
> distorted and does not reflect real-world performance.
> 
> I've found that by using a volatile function pointer to call the implementations
> (instead of direct calls), the results returned to a realistic range. It appears
> the previous benchmark logic allowed the compiler to over-optimize the test loop
> in ways that skewed the data.
> 
> I will refactor the benchmark logic in v3, specifically referencing the crc32
> KUnit implementation (e.g., using warm-up loops and adding preempt_disable()
> to eliminate context-switch interference) to ensure the data is robust and accurate.
> 

Just a quick follow-up: I've also verified that using a volatile variable to store
the return value (as seen in crc_benchmark()) is equally effective at preventing
the optimization.

The core change is as follows:

    volatile size_t len;
    ...
    for (unsigned int j = 0; j < iters; j++) {
        OPTIMIZER_HIDE_VAR(buf);
        len = strlen(buf);
    }

Preliminary results with this change look much more reasonable:

    ok 4 string_test_strlen
    # string_test_strlen_bench: strlen performance (short, len: 8, iters: 100000):
    # string_test_strlen_bench:   arch-optimized: 4767500 ns
    # string_test_strlen_bench:   generic C:      5815800 ns
    # string_test_strlen_bench:   speedup:        1.21x
    # string_test_strlen_bench: strlen performance (medium, len: 64, iters: 100000):
    # string_test_strlen_bench:   arch-optimized: 6573600 ns
    # string_test_strlen_bench:   generic C:      16342500 ns
    # string_test_strlen_bench:   speedup:        2.48x
    # string_test_strlen_bench: strlen performance (long, len: 2048, iters: 10000):
    # string_test_strlen_bench:   arch-optimized: 7931000 ns
    # string_test_strlen_bench:   generic C:      35347300 ns
    # string_test_strlen_bench:   speedup:        4.45x
    ok 5 string_test_strlen_bench

I will adopt this pattern in v3, along with cache warm-up and preempt_disable(),
to stay consistent with existing kernel benchmarks and ensure robust measurements.

-- 
With Best Regards,
Feng Jiang