[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6ynOre3V6FdYvkn@vaxr-BM6660-BM6360>
Date: Wed, 12 Feb 2025 21:50:50 +0800
From: I Hsin Cheng <richard120310@...il.com>
To: Yury Norov <yury.norov@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org
Subject: Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up
On Tue, Feb 11, 2025 at 01:39:32PM -0500, Yury Norov wrote:
> On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > bitmask with "l" trailing zeroes, which is equivalent to
> > "(~_UL(0) << (l))".
>
> I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> not, but it has very limited usage through the kernel, all in the uapi.
>
> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
>
> I'd like to look at it! Please share disassembly.
>
> > Signed-off-by: I Hsin Cheng <richard120310@...il.com>
> > ---
> > Test is done to show the speed-up we can get from reducing the number of
> > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
>
> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?
>
> Can you run bloat-o-meter against your change?
>
> > The test executes in the form of kernel module.
> >
> > Test result:
> > [451675.644026] new __GENMASK() : 5985416
> > [451675.644030] old __GENMASK() : 6006406
>
> The difference is 0.35%? That's impressive!
>
> Can you please run your test multiple times and print those numbers
> together with confidence interval?
>
> > Test script snippet:
> > /* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
> > \#define __GENMASK_NEW(h, l) \
> > ((~_UL(0) << (l)) & \
> > (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
> >
> > int init_module(void)
> > {
> > ktime_t start, end, total1 = 0, total2 = 0;
> > for (int k = 0; k < 100; k++) {
> > for (int i = 0; i < __BITS_PER_LONG; i++) {
> > for (int j = 0; j <= i; j++) {
> > unsigned result1, result2;
> >
> > start = ktime_get();
> > result1 = __GENMASK_NEW(i, j);
> > end = ktime_get();
> > total1 += (end - start);
>
> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
>
> Do it this way:
>
> start = ktime_get();
> for (...) {
> __always_used result = GENMASK();
> }
> time = ktime_get() - start;
>
>
> >
> > start = ktime_get();
> > result2 = __GENMASK(i, j);
>
> Please test GENMASK(), not __GENMASK().
>
> > end = ktime_get();
> > total2 += (end - start);
> >
> > if (result1 != result2) {
> > pr_info("Wrong calculation of GENMASK_NEW()\n");
> > return 0;
> > }
>
> For functional testing we have lib/test_bits.c, lib/test_bitmap.c and
> others. I expect you'll run them all to check correctness. Once you do
> that, you don't need to test correctness here.
>
> > }
> > }
> > }
> >
> > pr_info("new __GENMASK() : %lld\n", total1);
> > pr_info("old __GENMASK() : %lld\n", total2);
> >
> > return 0;
> > }
>
> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.
>
> > ---
> > include/uapi/linux/bits.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> > index 5ee30f882736..8fc7fea65288 100644
> > --- a/include/uapi/linux/bits.h
> > +++ b/include/uapi/linux/bits.h
> > @@ -5,7 +5,7 @@
> > #define _UAPI_LINUX_BITS_H
> >
> > #define __GENMASK(h, l) \
> > - (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> > + ((~_UL(0) << (l)) & \
> > (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
>
> Now it would fit a single line, right?
>
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
>
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.
>
> Thanks,
> Yury
>
> >
> > #define __GENMASK_ULL(h, l) \
> > --
> > 2.43.0
Hi Yury,
Thanks for your patience and detailed review!
> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
>
> I'd like to look at it! Please share disassembly.
What I am talking about here is current version has ("-", "<<", "+") as
operators and after the change there'll be only ("<<").
Though I don't think these 2 macro will have any differences in
assembly, because they're all evaluated to a number at compile time, I
can still test to see the result.
I wrote a function for simplicity and check the disassembly using objdump -d
void test_func() {
int a = 10, b = 2;
unsigned n = __GENMASK(a, b);
}
$objdump -d genmask_old
...
0000000000001129 <test_func>:
1129: f3 0f 1e fa endbr64
112d: 55 push %rbp
112e: 48 89 e5 mov %rsp,%rbp
1131: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
1138: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%rbp)
113f: 8b 45 f8 mov -0x8(%rbp),%eax
1142: ba 01 00 00 00 mov $0x1,%edx
1147: 89 c1 mov %eax,%ecx
1149: 48 d3 e2 shl %cl,%rdx
114c: 48 89 d0 mov %rdx,%rax
114f: f7 d8 neg %eax
1151: 89 c2 mov %eax,%edx
1153: b8 3f 00 00 00 mov $0x3f,%eax
1158: 2b 45 f4 sub -0xc(%rbp),%eax
115b: 48 c7 c6 ff ff ff ff mov $0xffffffffffffffff,%rsi
1162: 89 c1 mov %eax,%ecx
1164: 48 d3 ee shr %cl,%rsi
1167: 48 89 f0 mov %rsi,%rax
116a: 21 d0 and %edx,%eax
116c: 89 45 fc mov %eax,-0x4(%rbp)
116f: 90 nop
1170: 5d pop %rbp
1171: c3 ret
...
$objdump -d genmask_new
...
0000000000001129 <test_func>:
1129: f3 0f 1e fa endbr64
112d: 55 push %rbp
112e: 48 89 e5 mov %rsp,%rbp
1131: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
1138: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%rbp)
113f: 8b 45 f8 mov -0x8(%rbp),%eax
1142: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
1149: 89 c1 mov %eax,%ecx
114b: 48 d3 e2 shl %cl,%rdx
114e: 48 89 d0 mov %rdx,%rax
1151: 89 c2 mov %eax,%edx
1153: b8 3f 00 00 00 mov $0x3f,%eax
1158: 2b 45 f4 sub -0xc(%rbp),%eax
115b: 48 c7 c6 ff ff ff ff mov $0xffffffffffffffff,%rsi
1162: 89 c1 mov %eax,%ecx
1164: 48 d3 ee shr %cl,%rsi
1167: 48 89 f0 mov %rsi,%rax
116a: 21 d0 and %edx,%eax
116c: 89 45 fc mov %eax,-0x4(%rbp)
116f: 90 nop
1170: 5d pop %rbp
1171: c3 ret
...
> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?
I don't have available arm machines or others now, let me try to get them
and test.
> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
>
> Do it this way:
No problem, I rewrite the test like the following
start = ktime_get();
for (int k = 0; k < 1000; k++) {
for (int i = 0; i < __BITS_PER_LONG_LONG; i++) {
for (int j = 0; j <= i; j++) {
unsigned result1;
result1 = GENMASK(i, j);
}
}
}
end = ktime_get();
And here the result after running old version of GENMASK() and new
version of GENMASK(). Each running for 5 times.
[530775.667879] old GENMASK() : 60
[530811.687541] old GENMASK() : 60
[530824.527262] old GENMASK() : 50
[530826.583092] old GENMASK() : 40
[530831.487851] old GENMASK() : 50
[530893.618084] new GENMASK() : 50
[530897.734948] new GENMASK() : 50
[530900.253968] new GENMASK() : 50
[530902.415798] new GENMASK() : 40
[530904.136347] new GENMASK() : 50
On avgerage, old version of GENMASK() cost 52 time unit, new version of
GENMASK() costs 48 time unit.
> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.
> Now it would fit a single line, right?
>
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
>
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.
Sure thing, I'll append the corresponding test into existing tests and
perform all the test again, though for different machine architecture
parts could be hard for me, I only have x86 machine now, I think I can
get an arm machine, but can't sure for other architectures.
I'll try to prepare everything as possible as I can and send v2. Thanks.
Best regards,
I Hsin Cheng.
Powered by blists - more mailing lists