[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250525123845.5b023297@pumpkin>
Date: Sun, 25 May 2025 12:38:45 +0100
From: David Laight <david.laight.linux@...il.com>
To: Nicolas Pitre <npitre@...libre.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, Peter
Zijlstra <peterz@...radead.org>, Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and
mul_u64_u64_div_u64_roundup()
On Wed, 21 May 2025 09:50:28 -0400 (EDT)
Nicolas Pitre <npitre@...libre.com> wrote:
> On Wed, 21 May 2025, David Laight wrote:
>...
>
> Depends how costly the ilog2 is. On ARM the clz instruction is about 1
> cycle. If you need to figure out the MSB manually then it might be best
> to skip those ilog2's.
I was going to measure it.
But I've pulled chunks from the kernel headers into a userspace build.
This is the x86-32 code for the 'if (ilog2(a) + ilog2(b) < 62)' test:
1b2b: 0f bd c6 bsr %esi,%eax
1b2e: 75 05 jne 1b35 <mul_u64_add_u64_div_u64_new+0x75>
1b30: b8 ff ff ff ff mov $0xffffffff,%eax
1b35: 85 c9 test %ecx,%ecx
1b37: 74 0d je 1b46 <mul_u64_add_u64_div_u64_new+0x86>
1b39: 0f bd c1 bsr %ecx,%eax
1b3c: 75 05 jne 1b43 <mul_u64_add_u64_div_u64_new+0x83>
1b3e: b8 ff ff ff ff mov $0xffffffff,%eax
1b43: 83 c0 20 add $0x20,%eax
1b46: 0f bd d5 bsr %ebp,%edx
1b49: 75 05 jne 1b50 <mul_u64_add_u64_div_u64_new+0x90>
1b4b: ba ff ff ff ff mov $0xffffffff,%edx
1b50: 85 ff test %edi,%edi
1b52: 74 0d je 1b61 <mul_u64_add_u64_div_u64_new+0xa1>
1b54: 0f bd d7 bsr %edi,%edx
1b57: 75 05 jne 1b5e <mul_u64_add_u64_div_u64_new+0x9e>
1b59: ba ff ff ff ff mov $0xffffffff,%edx
1b5e: 83 c2 20 add $0x20,%edx
1b61: 8d 1c 02 lea (%edx,%eax,1),%ebx
1b64: 83 fb 3e cmp $0x3e,%ebx
1b67: 0f 8e 0b 03 00 00 jle 1e78 <mul_u64_add_u64_div_u64_new+0x3b8>
If 'cmov' is enabled (not by default even after the current plan to remove 486 support) it is:
1b2b: ba ff ff ff ff mov $0xffffffff,%edx
1b30: 85 c9 test %ecx,%ecx
1b32: 0f 85 98 03 00 00 jne 1ed0 <mul_u64_add_u64_div_u64_new+0x410>
1b38: 0f bd c6 bsr %esi,%eax
1b3b: 0f 44 c2 cmove %edx,%eax
1b3e: bb ff ff ff ff mov $0xffffffff,%ebx
1b43: 85 ff test %edi,%edi
1b45: 0f 85 75 03 00 00 jne 1ec0 <mul_u64_add_u64_div_u64_new+0x400>
1b4b: 0f bd d5 bsr %ebp,%edx
1b4e: 0f 44 d3 cmove %ebx,%edx
1b51: 8d 1c 02 lea (%edx,%eax,1),%ebx
1b54: 83 fb 3e cmp $0x3e,%ebx
1b57: 0f 8e 0b 03 00 00 jle 1e68 <mul_u64_add_u64_div_u64_new+0x3a8>
with:
1ec0: 0f bd d7 bsr %edi,%edx
1ec3: 0f 44 d3 cmove %ebx,%edx
1ec6: 83 c2 20 add $0x20,%edx
1ec9: e9 83 fc ff ff jmp 1b51 <mul_u64_add_u64_div_u64_new+0x91>
1ed0: 0f bd c1 bsr %ecx,%eax
1ed3: 0f 44 c2 cmove %edx,%eax
1ed6: 83 c0 20 add $0x20,%eax
1ed9: e9 60 fc ff ff jmp 1b3e <mul_u64_add_u64_div_u64_new+0x7e>
Neither is pretty...
Some of the 'damage' is because the x86 'bsr' (bit scan reverse) sets 'z' for zero
and leaves the output undefined (unchanged on later cpu).
For reference I can get the multiply down to:
1b5d: 89 f0 mov %esi,%eax
1b5f: f7 e5 mul %ebp
1b61: 03 44 24 38 add 0x38(%esp),%eax
1b65: 83 d2 00 adc $0x0,%edx
1b68: 89 d3 mov %edx,%ebx
1b6a: 89 44 24 08 mov %eax,0x8(%esp)
1b6e: 89 f0 mov %esi,%eax
1b70: f7 e7 mul %edi
1b72: 03 44 24 3c add 0x3c(%esp),%eax
1b76: 83 d2 00 adc $0x0,%edx
1b79: 01 d8 add %ebx,%eax
1b7b: 83 d2 00 adc $0x0,%edx
1b7e: 89 d6 mov %edx,%esi
1b80: 89 c3 mov %eax,%ebx
1b82: 89 c8 mov %ecx,%eax
1b84: f7 e7 mul %edi
1b86: 89 c7 mov %eax,%edi
1b88: 89 c8 mov %ecx,%eax
1b8a: 01 f7 add %esi,%edi
1b8c: 83 d2 00 adc $0x0,%edx
1b8f: 89 d6 mov %edx,%esi
1b91: f7 e5 mul %ebp
1b93: 89 c1 mov %eax,%ecx
1b95: 8b 44 24 08 mov 0x8(%esp),%eax
1b99: 89 d5 mov %edx,%ebp
1b9b: 01 d9 add %ebx,%ecx
1b9d: 83 d5 00 adc $0x0,%ebp
1ba0: 89 44 24 28 mov %eax,0x28(%esp)
1ba4: 01 ef add %ebp,%edi
1ba6: 83 d6 00 adc $0x0,%esi
1ba9: 89 74 24 1c mov %esi,0x1c(%esp)
1bad: 8b 5c 24 1c mov 0x1c(%esp),%ebx
1bb1: 89 7c 24 18 mov %edi,0x18(%esp)
1bb5: 8b 44 24 18 mov 0x18(%esp),%eax
1bb9: 89 4c 24 2c mov %ecx,0x2c(%esp)
1bbd: 09 c3 or %eax,%ebx
1bbf: 0f 84 1b 03 00 00 je 1ee0 <mul_u64_add_u64_div_u64_new+0x420>
If you follow the register dependency chain it won't be as long as it looks.
(Although the last few instructions are terrible! - I've tried a few things
and they won't go away.)
David
Powered by blists - more mailing lists