[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4tZDcCDt+U69kUF@visitorckw-System-Product-Name>
Date: Sat, 18 Jan 2025 15:32:29 +0800
From: Kuan-Wei Chiu <visitorckw@...il.com>
To: Yury Norov <yury.norov@...il.com>
Cc: I Hsin Cheng <richard120310@...il.com>, linux@...musvillemoes.dk,
jserv@...s.ncku.edu.tw, mark.rutland@....com,
linux-kernel@...r.kernel.org, eleanor15x@...il.com
Subject: Re: [PATCH] cpumask: Optimize cpumask_any_but()
Hi Yury,
On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > The cpumask_any_but() function can avoid using a loop to determine the
> > > CPU index to return. If the first set bit in the cpumask is not equal
> > > to the specified CPU, we can directly return the index of the first set
> > > bit. Otherwise, we return the next set bit's index.
> > >
> > > This optimization replaces the loop with a single if statement,
> > > allowing the compiler to generate more concise and efficient code.
>
> I thought compilers are smart enough to unroll loop in this case. Can
> you show disassembled code before and after?
>
Since cpumask_any_but() is an inline function, I added the following to
lib/cpumask.c for convenience:
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
{
return cpumask_any_but(mask, cpu);
}
I used objdump -d ./lib/cpumask.o to compare the differences.
* Before the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 3f mov (%rdi),%rdi
1f7: b8 40 00 00 00 mov $0x40,%eax
1fc: 48 85 ff test %rdi,%rdi
1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c>
201: f3 48 0f bc d7 tzcnt %rdi,%rdx
206: 89 d0 mov %edx,%eax
208: 39 d6 cmp %edx,%esi
20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c>
20c: 83 fa 3f cmp $0x3f,%edx
20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c>
211: 41 b8 01 00 00 00 mov $0x1,%r8d
217: 83 c0 01 add $0x1,%eax
21a: 83 f8 40 cmp $0x40,%eax
21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c>
21f: 89 c1 mov %eax,%ecx
221: 4c 89 c2 mov %r8,%rdx
224: 48 d3 e2 shl %cl,%rdx
227: 48 89 d0 mov %rdx,%rax
22a: 48 f7 d8 neg %rax
22d: 48 21 f8 and %rdi,%rax
230: 74 15 je 247 <non_inline_cpumask_any_but+0x57>
232: f3 48 0f bc d0 tzcnt %rax,%rdx
237: 89 d0 mov %edx,%eax
239: 39 d6 cmp %edx,%esi
23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c>
23d: 83 fa 3f cmp $0x3f,%edx
240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27>
242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57>
247: b8 40 00 00 00 mov $0x40,%eax
24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61>
* After the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 17 mov (%rdi),%rdx
1f7: 48 85 d2 test %rdx,%rdx
1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40>
1fc: f3 48 0f bc ca tzcnt %rdx,%rcx
201: 89 c8 mov %ecx,%eax
203: 39 ce cmp %ecx,%esi
205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45>
207: 83 c1 01 add $0x1,%ecx
20a: 83 f9 3f cmp $0x3f,%ecx
20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40>
20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
216: 48 d3 e0 shl %cl,%rax
219: 48 89 c1 mov %rax,%rcx
21c: b8 40 00 00 00 mov $0x40,%eax
221: 48 21 d1 and %rdx,%rcx
224: 74 0f je 235 <non_inline_cpumask_any_but+0x45>
226: f3 48 0f bc c1 tzcnt %rcx,%rax
22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40>
230: b8 40 00 00 00 mov $0x40,%eax
235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a>
> > >
> > > As a result, the size of the bzImage built with x86 defconfig is
> > > reduced by 4096 bytes:
> > >
> > > * Before:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> > >
> > > * After:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
>
> Comparing zipped images tells little about code generation. Please use
> scripts/bloat-o-meter.
>
$ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
Function old new delta
non_inline_cpumask_any_but 97 74 -23
Total: Before=522, After=499, chg -4.41%
> > >
> > > Co-developed-by: Yu-Chun Lin <eleanor15x@...il.com>
> > > Signed-off-by: Yu-Chun Lin <eleanor15x@...il.com>
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@...il.com>
> > > ---
> > > Not sure how to measure the efficiency difference, but I guess this
> > > patch might be slightly more efficient or nearly the same as before. If
> > > you have any good ideas for measuring efficiency, please let me know!
>
> Check lib/find_bit_benchmark.c
>
> > >
> > > include/linux/cpumask.h | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 9278a50d514f..b769fcdbaa10 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > > unsigned int i;
> > >
> > > cpumask_check(cpu);
> > > - for_each_cpu(i, mask)
> > > - if (i != cpu)
> > > - break;
> > > - return i;
> > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> >
> > Hi Kuan-Wei,
> >
> > How about using cpumask_first(mask) here to keep better consistency?
>
> I would do it the other way: introduce find_first_but_bit(), and then
> make cpumask_any_but() a wrapper around it. Doing this you'll be able
> to leverage find_bit_benchmark infrastructure to measure performance
> difference, if any.
>
I'll try to conduct this experiment.
Regards,
Kuan-Wei
> > > + if (i != cpu)
> > > + return i;
> > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> > find_next_bit() ?
>
> Yes it would.
>
> Thanks,
> Yury
>
> > if "i >= nr_cpu_ids" holds it would be a waste to
> > perform find_next_bit().
> >
> > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > > }
> > >
> >
> > Regards,
> > I Hsin
> >
> > > /**
> > > --
> > > 2.34.1
> > >
Powered by blists - more mailing lists