lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6ynOre3V6FdYvkn@vaxr-BM6660-BM6360>
Date: Wed, 12 Feb 2025 21:50:50 +0800
From: I Hsin Cheng <richard120310@...il.com>
To: Yury Norov <yury.norov@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
	jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org
Subject: Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up

On Tue, Feb 11, 2025 at 01:39:32PM -0500, Yury Norov wrote:
> On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > bitmask with "l" trailing zeroes, which is equivalent to
> > "(~_UL(0) << (l))".
> 
> I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> not, but it has very limited usage through the kernel, all in the uapi.
>  
> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
> 
> I'd like to look at it! Please share disassembly.
> 
> > Signed-off-by: I Hsin Cheng <richard120310@...il.com>
> > ---
> > Test is done to show the speed-up we can get from reducing the number of
> > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
> 
> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?
> 
> Can you run bloat-o-meter against your change?
> 
> > The test executes in the form of kernel module.
> > 
> > Test result:
> > [451675.644026] new __GENMASK() : 5985416
> > [451675.644030] old __GENMASK() : 6006406
> 
> The difference is 0.35%? That's impressive!
> 
> Can you please run your test multiple times and print those numbers
> together with confidence interval?
>  
> > Test script snippet:
> > /* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
> > \#define __GENMASK_NEW(h, l) \
> >         ((~_UL(0) << (l)) & \
> >          (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
> >
> > int init_module(void)
> > {
> >     ktime_t start, end, total1 = 0, total2 = 0;
> >     for (int k = 0; k < 100; k++) {
> >         for (int i = 0; i < __BITS_PER_LONG; i++) {
> >             for (int j = 0; j <= i; j++) {
> >                 unsigned result1, result2;
> > 
> >                 start = ktime_get();
> >                 result1 = __GENMASK_NEW(i, j);
> >                 end = ktime_get();
> >                 total1 += (end - start);
> 
> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
> 
> Do it this way:
> 
>         start = ktime_get();
>         for (...) {
>                 __always_used result = GENMASK();
>         }
>         time = ktime_get() - start;
> 
> 
> > 
> >                 start = ktime_get();
> >                 result2 = __GENMASK(i, j);
> 
> Please test GENMASK(), not __GENMASK().
> 
> >                 end = ktime_get();
> >                 total2 += (end - start);
> > 
> >                 if (result1 != result2) {
> >                     pr_info("Wrong calculation of GENMASK_NEW()\n");
> >                     return 0;
> >                 }
> 
> For functional testing we have lib/test_bits.c, lib/test_bitmap.c and
> others. I expect you'll run them all to check correctness. Once you do
> that, you don't need to test correctness here.
> 
> >             }
> >         }
> >     }
> > 
> >     pr_info("new __GENMASK() : %lld\n", total1);
> >     pr_info("old __GENMASK() : %lld\n", total2);
> > 
> >     return 0;
> > }
> 
> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.
> 
> > ---
> >  include/uapi/linux/bits.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> > index 5ee30f882736..8fc7fea65288 100644
> > --- a/include/uapi/linux/bits.h
> > +++ b/include/uapi/linux/bits.h
> > @@ -5,7 +5,7 @@
> >  #define _UAPI_LINUX_BITS_H
> >  
> >  #define __GENMASK(h, l) \
> > -        (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> > +	((~_UL(0) << (l)) & \
> >           (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
> 
> Now it would fit a single line, right?
> 
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
> 
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.
> 
> Thanks,
> Yury
> 
> >  
> >  #define __GENMASK_ULL(h, l) \
> > -- 
> > 2.43.0


Hi Yury,

Thanks for your patience and detailed review!

> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
>
> I'd like to look at it! Please share disassembly.

What I am talking about here is current version has ("-", "<<", "+") as
operators and after the change there'll be only ("<<").
Though I don't think these 2 macro will have any differences in
assembly, because they're all evaluated to a number at compile time, I
can still test to see the result.

I wrote a function for simplicity and check the disassembly using objdump -d

void test_func() {
    int a = 10, b = 2;
    unsigned n = __GENMASK(a, b);
}

$objdump -d genmask_old
...
0000000000001129 <test_func>:
    1129:	f3 0f 1e fa          	endbr64
    112d:	55                   	push   %rbp
    112e:	48 89 e5             	mov    %rsp,%rbp
    1131:	c7 45 f4 0a 00 00 00 	movl   $0xa,-0xc(%rbp)
    1138:	c7 45 f8 02 00 00 00 	movl   $0x2,-0x8(%rbp)
    113f:	8b 45 f8             	mov    -0x8(%rbp),%eax
    1142:	ba 01 00 00 00       	mov    $0x1,%edx
    1147:	89 c1                	mov    %eax,%ecx
    1149:	48 d3 e2             	shl    %cl,%rdx
    114c:	48 89 d0             	mov    %rdx,%rax
    114f:	f7 d8                	neg    %eax
    1151:	89 c2                	mov    %eax,%edx
    1153:	b8 3f 00 00 00       	mov    $0x3f,%eax
    1158:	2b 45 f4             	sub    -0xc(%rbp),%eax
    115b:	48 c7 c6 ff ff ff ff 	mov    $0xffffffffffffffff,%rsi
    1162:	89 c1                	mov    %eax,%ecx
    1164:	48 d3 ee             	shr    %cl,%rsi
    1167:	48 89 f0             	mov    %rsi,%rax
    116a:	21 d0                	and    %edx,%eax
    116c:	89 45 fc             	mov    %eax,-0x4(%rbp)
    116f:	90                   	nop
    1170:	5d                   	pop    %rbp
    1171:	c3                   	ret
...

$objdump -d genmask_new
...
0000000000001129 <test_func>:
    1129:	f3 0f 1e fa          	endbr64
    112d:	55                   	push   %rbp
    112e:	48 89 e5             	mov    %rsp,%rbp
    1131:	c7 45 f4 0a 00 00 00 	movl   $0xa,-0xc(%rbp)
    1138:	c7 45 f8 02 00 00 00 	movl   $0x2,-0x8(%rbp)
    113f:	8b 45 f8             	mov    -0x8(%rbp),%eax
    1142:	48 c7 c2 ff ff ff ff 	mov    $0xffffffffffffffff,%rdx
    1149:	89 c1                	mov    %eax,%ecx
    114b:	48 d3 e2             	shl    %cl,%rdx
    114e:	48 89 d0             	mov    %rdx,%rax
    1151:	89 c2                	mov    %eax,%edx
    1153:	b8 3f 00 00 00       	mov    $0x3f,%eax
    1158:	2b 45 f4             	sub    -0xc(%rbp),%eax
    115b:	48 c7 c6 ff ff ff ff 	mov    $0xffffffffffffffff,%rsi
    1162:	89 c1                	mov    %eax,%ecx
    1164:	48 d3 ee             	shr    %cl,%rsi
    1167:	48 89 f0             	mov    %rsi,%rax
    116a:	21 d0                	and    %edx,%eax
    116c:	89 45 fc             	mov    %eax,-0x4(%rbp)
    116f:	90                   	nop
    1170:	5d                   	pop    %rbp
    1171:	c3                   	ret
...

> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?

I don't have available arm machines or others now, let me try to get them
and test.

> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
>
> Do it this way:

No problem, I rewrite the test like the following

    start = ktime_get();
    for (int k = 0; k < 1000; k++) {
        for (int i = 0; i < __BITS_PER_LONG_LONG; i++) {
            for (int j = 0; j <= i; j++) {
                unsigned result1;
                result1 = GENMASK(i, j);
            }
        }
    }
    end = ktime_get();

And here the result after running old version of GENMASK() and new
version of GENMASK(). Each running for 5 times.

[530775.667879] old GENMASK() : 60
[530811.687541] old GENMASK() : 60
[530824.527262] old GENMASK() : 50
[530826.583092] old GENMASK() : 40
[530831.487851] old GENMASK() : 50
[530893.618084] new GENMASK() : 50
[530897.734948] new GENMASK() : 50
[530900.253968] new GENMASK() : 50
[530902.415798] new GENMASK() : 40
[530904.136347] new GENMASK() : 50

On avgerage, old version of GENMASK() cost 52 time unit, new version of
GENMASK() costs 48 time unit.

> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.

> Now it would fit a single line, right?
>
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
>
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.

Sure thing, I'll append the corresponding test into existing tests and
perform all the test again, though for different machine architecture
parts could be hard for me, I only have x86 machine now, I think I can
get an arm machine, but can't sure for other architectures.

I'll try to prepare everything as possible as I can and send v2. Thanks.

Best regards,
I Hsin Cheng.



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ