[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6uZZPpQ9YYfrL-I@thinkpad>
Date: Tue, 11 Feb 2025 13:39:32 -0500
From: Yury Norov <yury.norov@...il.com>
To: I Hsin Cheng <richard120310@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org
Subject: Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up
On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> bitmask with "l" trailing zeroes, which is equivalent to
> "(~_UL(0) << (l))".
I used to think that GENMASK() is a compile-time macro. __GENMASK() is
not, but it has very limited usage through the kernel, all in the uapi.
> Refactor the calculation so the number of arithmetic instruction will be
> reduced from 3 to 1.
I'd like to look at it! Please share disassembly.
> Signed-off-by: I Hsin Cheng <richard120310@...il.com>
> ---
> Test is done to show the speed-up we can get from reducing the number of
> instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
So you CC arm maintainers and provide tests against x86_64. Are your
improvements consistent for arm, power and other arches?
Can you run bloat-o-meter against your change?
> The test executes in the form of kernel module.
>
> Test result:
> [451675.644026] new __GENMASK() : 5985416
> [451675.644030] old __GENMASK() : 6006406
The difference is 0.35%? That's impressive!
Can you please run your test multiple times and print those numbers
together with confidence interval?
> Test script snippet:
> /* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
> \#define __GENMASK_NEW(h, l) \
> ((~_UL(0) << (l)) & \
> (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
>
> int init_module(void)
> {
> ktime_t start, end, total1 = 0, total2 = 0;
> for (int k = 0; k < 100; k++) {
> for (int i = 0; i < __BITS_PER_LONG; i++) {
> for (int j = 0; j <= i; j++) {
> unsigned result1, result2;
>
> start = ktime_get();
> result1 = __GENMASK_NEW(i, j);
> end = ktime_get();
> total1 += (end - start);
Nah, this is wrong. I feel like you measure performance of
ktime_get() rather than GENMASK().
Do it this way:
start = ktime_get();
for (...) {
__always_used result = GENMASK();
}
time = ktime_get() - start;
>
> start = ktime_get();
> result2 = __GENMASK(i, j);
Please test GENMASK(), not __GENMASK().
> end = ktime_get();
> total2 += (end - start);
>
> if (result1 != result2) {
> pr_info("Wrong calculation of GENMASK_NEW()\n");
> return 0;
> }
For functional testing we have lib/test_bits.c, lib/test_bitmap.c and
others. I expect you'll run them all to check correctness. Once you do
that, you don't need to test correctness here.
> }
> }
> }
>
> pr_info("new __GENMASK() : %lld\n", total1);
> pr_info("old __GENMASK() : %lld\n", total2);
>
> return 0;
> }
Please incorporate this in one of existing tests and prepend the
series with it. That way you'll let the others to run before/after for
your series.
> ---
> include/uapi/linux/bits.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> index 5ee30f882736..8fc7fea65288 100644
> --- a/include/uapi/linux/bits.h
> +++ b/include/uapi/linux/bits.h
> @@ -5,7 +5,7 @@
> #define _UAPI_LINUX_BITS_H
>
> #define __GENMASK(h, l) \
> - (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> + ((~_UL(0) << (l)) & \
> (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
Now it would fit a single line, right?
If it works, this is a nice simplification, even though it's only a
compile time thing.
Please address all the comments above and run low-level tests I
mentioned above, so that we'll make sure it has no side effects.
Thanks,
Yury
>
> #define __GENMASK_ULL(h, l) \
> --
> 2.43.0
Powered by blists - more mailing lists