linux-kernel - Re: [PATCH v5] clocksource: Replace loop within clocks_calc_mult_shift() with _

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aE2obOfz6KHXRjKP@yury>
Date: Sat, 14 Jun 2025 12:50:52 -0400
From: Yury Norov <yury.norov@...il.com>
To: I Hsin Cheng <richard120310@...il.com>
Cc: jstultz@...gle.com, tglx@...utronix.de, sboyd@...nel.org,
	linux-kernel@...r.kernel.org, jserv@...s.ncku.edu.tw,
	skhan@...uxfoundation.org, linux-kernel-mentees@...ts.linux.dev
Subject: Re: [PATCH v5] clocksource: Replace loop within
 clocks_calc_mult_shift() with __fls() for calculation of "sftacc"

On Fri, Jun 13, 2025 at 11:52:39AM +0800, I Hsin Cheng wrote:
> Utilize "__fls()" in replacement of while loop counting
> for the decremenet of "sftacc". They're equivalent in computation result
> but the former is more effective.
> 
> "__fls()" will return the bit number of the last set bit of
> "tmp", which is 0-based index. Plus 1 to convert it into bit width as
> desired.
> 
> Note that only the lowest 32 bits of "tmp" is taken into consideration
> of the operation, since it was already shifted right by 32 bits, the
> topmost 32 bits should remain 0, only the lowest 32 bits are possible to
> be non-zero.
> 
> This change is tested against a test script [1].

And because [1] is in temporary section, git readers will have no chance
to follow it.

> Run the test 10 times for each version of implementation and take the
> average. The result shown that with this change, the operation overhead
> of "clocks_calc_mult_shift()" can be reduced around 99.7% .
> 
> -----------------------------
> | old version | new version |
> -----------------------------
> |  11500.6 ns |       44 ns |
> -----------------------------

44 ns is more or less minimal time delay that a typical x86_64 machine
is able to measure. What you have measured on 'new' side is pretty
likely a single timer tick, maybe two. The reliable time intervals for
performance measurements are within few milliseconds.

> Signed-off-by: I Hsin Cheng <richard120310@...il.com>
> ---
> Changelog:
> 
> v1 -> v2:
> 	- Refine commit message to explain more about "why"
> 	- Check the frequency of "clocks_calc_mult_shift()" get called,
> 	  it's not in hotpath on my machine, refine the commit message
> to avoid overselling it
> 	- Add comments for the code to explain the implementation in
> 	  more detail
> 	- Handle case for "tmp == 0" to avoid undefined behavior
> 
> v2 -> v3:
> 	- Use "find_last_bit()" instead of "__builtin_clz()"
> 	- Convert the type of "tmp" to "const unsigned long *" when
> 	  sending into the function
> 	- Highlight in the comment that only the lowest 32 bits part
> 	  of "tmp" is taken into consideration
> 
> v3 -> v4:
> 	- Use "__fls()" since "tmp" is of type u64, not cpumask
> 	- Refine commit messages to match the current implementation
> 
> v4 -> v5:
> 	- Update commit header to mention the use of __fls()
> 
> [1]:
> static int __init test_init(void)
> {
>     u32 mult, shift;
>     u32 from, to, maxsec;
>     ktime_t start_time, end_time, total_time;
>     pr_info("Starting clocks_calc_mult_shift simple test\n");
> 
>     start_time = ktime_get();
>     // Test with parameters from 1 to 1000
>     for (from = 1; from <= 1000; from += 100) {
>         for (to = 1; to <= 1000; to += 100) {
>             for (maxsec = 1; maxsec <= 10; maxsec++) {
> 
>                 clocks_calc_mult_shift(&mult, &shift, from, to, maxsec);
>             }
>         }
>     }
> 
>     end_time = ktime_get();
>     total_time = ktime_to_ns(ktime_sub(end_time, start_time));
> 
>     pr_info("Test completed\n");
>     pr_info("Total execution time: %lld ns \n", total_time);
>     return 0;
> }
> 
> The test is running in the form of kernel module.
> The test machine is running ubuntu 24.04 on x86_64 machine with kernel
> version of v6.14.0, CPU type is AMD Ryzen 7 5700X3D 8-Core Processor.

x86 has the accelerated __fls(), so employing it broadly is a
non-questionable improvement. I don't see any benefit in posting the
snippets of that sort. Does somebody doubt that one wins over another?

The problem of clocks_calc_mult_shift() is that it opencodes the
existing helper, not that it does that by using a naive shift approach.
It may not be a performance problem if it's not a hot path, but it's
for sure a sort of coding culture problem.

If you want to measure accelerated __fls() vs generic___fls() vs this
naive __fls() performance, it may be an interesting exercise, but
definitely out of the scope of clocksource improvement.

If you want to do that, I'd suggest to extend the find_bit_benchmark
test.

> Best regards,
> I Hsin Cheng.
> ---
>  kernel/time/clocksource.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 2a7802ec480c..1e3dc68c696d 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -66,10 +66,19 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
>  	 * range:
>  	 */
>  	tmp = ((u64)maxsec * from) >> 32;
> -	while (tmp) {
> -		tmp >>=1;
> -		sftacc--;
> -	}
> +
> +	/*
> +	 * Decrement "sftacc" by the number of bits needed to represent "tmp".
> +	 * Using "__fls(tmp) + 1" to get the bit width:
> +	 * - __fls(tmp) returns the bit number of the last set bit
> +	 * - Plus 1 to convert 0-based index into bit width as desired

Please don't explain how, just explain why.

> +	 *
> +	 * Note: Only the lowest 32 bits of "tmp" is taken into consideration,
> +	 *		 since it was already shifted right by 32 bits, the topmost 32
> +	 *		 bits are guaranteed to be 0.

Then tmp should be u32, right?

I think compiler would warn about implicit 64-to-32 typecast for
32-bit targets, which would be a false positive in this case.

> +	 */
> +	if (tmp)
> +		sftacc -= __fls(tmp) + 1;

Suggested-by: Yury Norov [NVIDIA] <yury.norov@...dia.com> # for __fls()

>  
>  	/*
>  	 * Find the conversion shift/mult pair which has the best
> -- 
> 2.43.0