linux-kernel - Re: [PATCH] cpumask: Optimize cpumask_any

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4tZDcCDt+U69kUF@visitorckw-System-Product-Name>
Date: Sat, 18 Jan 2025 15:32:29 +0800
From: Kuan-Wei Chiu <visitorckw@...il.com>
To: Yury Norov <yury.norov@...il.com>
Cc: I Hsin Cheng <richard120310@...il.com>, linux@...musvillemoes.dk,
	jserv@...s.ncku.edu.tw, mark.rutland@....com,
	linux-kernel@...r.kernel.org, eleanor15x@...il.com
Subject: Re: [PATCH] cpumask: Optimize cpumask_any_but()

Hi Yury,

On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > The cpumask_any_but() function can avoid using a loop to determine the
> > > CPU index to return. If the first set bit in the cpumask is not equal
> > > to the specified CPU, we can directly return the index of the first set
> > > bit. Otherwise, we return the next set bit's index.
> > > 
> > > This optimization replaces the loop with a single if statement,
> > > allowing the compiler to generate more concise and efficient code.
> 
> I thought compilers are smart enough to unroll loop in this case. Can
> you show disassembled code before and after?
> 
Since cpumask_any_but() is an inline function, I added the following to
lib/cpumask.c for convenience:

unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
{
	return cpumask_any_but(mask, cpu);
}

I used objdump -d ./lib/cpumask.o to compare the differences.

* Before the patch:

00000000000001f0 <non_inline_cpumask_any_but>:
 1f0:	f3 0f 1e fa          	endbr64 
 1f4:	48 8b 3f             	mov    (%rdi),%rdi
 1f7:	b8 40 00 00 00       	mov    $0x40,%eax
 1fc:	48 85 ff             	test   %rdi,%rdi
 1ff:	74 4b                	je     24c <non_inline_cpumask_any_but+0x5c>
 201:	f3 48 0f bc d7       	tzcnt  %rdi,%rdx
 206:	89 d0                	mov    %edx,%eax
 208:	39 d6                	cmp    %edx,%esi
 20a:	75 40                	jne    24c <non_inline_cpumask_any_but+0x5c>
 20c:	83 fa 3f             	cmp    $0x3f,%edx
 20f:	77 3b                	ja     24c <non_inline_cpumask_any_but+0x5c>
 211:	41 b8 01 00 00 00    	mov    $0x1,%r8d
 217:	83 c0 01             	add    $0x1,%eax
 21a:	83 f8 40             	cmp    $0x40,%eax
 21d:	74 2d                	je     24c <non_inline_cpumask_any_but+0x5c>
 21f:	89 c1                	mov    %eax,%ecx
 221:	4c 89 c2             	mov    %r8,%rdx
 224:	48 d3 e2             	shl    %cl,%rdx
 227:	48 89 d0             	mov    %rdx,%rax
 22a:	48 f7 d8             	neg    %rax
 22d:	48 21 f8             	and    %rdi,%rax
 230:	74 15                	je     247 <non_inline_cpumask_any_but+0x57>
 232:	f3 48 0f bc d0       	tzcnt  %rax,%rdx
 237:	89 d0                	mov    %edx,%eax
 239:	39 d6                	cmp    %edx,%esi
 23b:	75 0f                	jne    24c <non_inline_cpumask_any_but+0x5c>
 23d:	83 fa 3f             	cmp    $0x3f,%edx
 240:	76 d5                	jbe    217 <non_inline_cpumask_any_but+0x27>
 242:	e9 00 00 00 00       	jmp    247 <non_inline_cpumask_any_but+0x57>
 247:	b8 40 00 00 00       	mov    $0x40,%eax
 24c:	e9 00 00 00 00       	jmp    251 <non_inline_cpumask_any_but+0x61>

* After the patch:

00000000000001f0 <non_inline_cpumask_any_but>:
 1f0:	f3 0f 1e fa          	endbr64 
 1f4:	48 8b 17             	mov    (%rdi),%rdx
 1f7:	48 85 d2             	test   %rdx,%rdx
 1fa:	74 34                	je     230 <non_inline_cpumask_any_but+0x40>
 1fc:	f3 48 0f bc ca       	tzcnt  %rdx,%rcx
 201:	89 c8                	mov    %ecx,%eax
 203:	39 ce                	cmp    %ecx,%esi
 205:	75 2e                	jne    235 <non_inline_cpumask_any_but+0x45>
 207:	83 c1 01             	add    $0x1,%ecx
 20a:	83 f9 3f             	cmp    $0x3f,%ecx
 20d:	77 21                	ja     230 <non_inline_cpumask_any_but+0x40>
 20f:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
 216:	48 d3 e0             	shl    %cl,%rax
 219:	48 89 c1             	mov    %rax,%rcx
 21c:	b8 40 00 00 00       	mov    $0x40,%eax
 221:	48 21 d1             	and    %rdx,%rcx
 224:	74 0f                	je     235 <non_inline_cpumask_any_but+0x45>
 226:	f3 48 0f bc c1       	tzcnt  %rcx,%rax
 22b:	e9 00 00 00 00       	jmp    230 <non_inline_cpumask_any_but+0x40>
 230:	b8 40 00 00 00       	mov    $0x40,%eax
 235:	e9 00 00 00 00       	jmp    23a <non_inline_cpumask_any_but+0x4a>

> > > 
> > > As a result, the size of the bzImage built with x86 defconfig is
> > > reduced by 4096 bytes:
> > > 
> > > * Before:
> > > $ size arch/x86/boot/bzImage
> > >    text    data     bss     dec     hex filename
> > > 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> > > 
> > > * After:
> > > $ size arch/x86/boot/bzImage
> > >    text    data     bss     dec     hex filename
> > > 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage
> 
> Comparing zipped images tells little about code generation. Please use
> scripts/bloat-o-meter.
> 
$ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
Function                                     old     new   delta
non_inline_cpumask_any_but                    97      74     -23
Total: Before=522, After=499, chg -4.41%

> > > 
> > > Co-developed-by: Yu-Chun Lin <eleanor15x@...il.com>
> > > Signed-off-by: Yu-Chun Lin <eleanor15x@...il.com>
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@...il.com>
> > > ---
> > > Not sure how to measure the efficiency difference, but I guess this
> > > patch might be slightly more efficient or nearly the same as before. If
> > > you have any good ideas for measuring efficiency, please let me know!
> 
> Check lib/find_bit_benchmark.c
> 
> > > 
> > >  include/linux/cpumask.h | 8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 9278a50d514f..b769fcdbaa10 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > >  	unsigned int i;
> > >  
> > >  	cpumask_check(cpu);
> > > -	for_each_cpu(i, mask)
> > > -		if (i != cpu)
> > > -			break;
> > > -	return i;
> > > +	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> > 
> > Hi Kuan-Wei,
> > 
> > How about using cpumask_first(mask) here to keep better consistency?
> 
> I would do it the other way: introduce find_first_but_bit(), and then
> make cpumask_any_but() a wrapper around it. Doing this you'll be able
> to leverage find_bit_benchmark infrastructure to measure performance
> difference, if any.
>  
I'll try to conduct this experiment.

Regards,
Kuan-Wei

> > > +	if (i != cpu)
> > > +		return i;
> > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> > find_next_bit() ?
> 
> Yes it would.
> 
> Thanks,
> Yury
> 
> > if "i >= nr_cpu_ids" holds it would be a waste to
> > perform find_next_bit().
> > 
> > > +	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > >  }
> > >  
> > 
> > Regards,
> > I Hsin
> > 
> > >  /**
> > > -- 
> > > 2.34.1
> > >