linux-kernel - Re: [PATCH v2] uapi: Revert "bitops: avoid integer overflow in GENMASK(

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7jM02O36MJBz8S_@thinkpad>
Date: Fri, 21 Feb 2025 13:58:27 -0500
From: Yury Norov <yury.norov@...il.com>
To: I Hsin Cheng <richard120310@...il.com>
Cc: anshuman.khandual@....com, arnd@...db.de, linux-kernel@...r.kernel.org,
	jserv@...s.ncku.edu.tw, skhan@...uxfoundation.org, mka@...omium.org,
	akpm@...ux-foundation.org, Kees Cook <kees@...nel.org>,
	Maxim Kuvyrkov <maxim.kuvyrkov@...aro.org>
Subject: Re: [PATCH v2] uapi: Revert "bitops: avoid integer overflow in
 GENMASK(_ULL)"

On Fri, Feb 21, 2025 at 03:41:49PM +0800, I Hsin Cheng wrote:
> This patch reverts 'commit c32ee3d9abd2("bitops: avoid integer overflow in
>  GENMASK(_ULL)")'.
> 
> The code generation can be shrink by over 1k by reverting the commit.
> Orginally the commit claims that clang will emit warnings using the
> implementation at that time.
> 
> Numerous tests are done to verify the code size shrink result and
> whether the warnings are emitted.
> 
> Signed-off-by: I Hsin Cheng <richard120310@...il.com>

This is the biggest email I've ever received in the maillist. :-)

> ---
> Tests performed on ubuntu 24.04 with AMD Ryzen 7 5700X3D 8-Core
> Processor on x86_64 with kernel version v6.14.0-rc1.
> 
> Test compilation with gcc-13, gcc-12, gcc-11, gcc cross compiler,
> clang-17, clang-18 and clang-19. No warnings are emitted for W=1 and
> W=2. However, with W=0, many errors and warnings will pop out.

W=0 means that you don't pass it at all. I don't know what the scripts
will do if you pass W=0...

> For gcc-13 with W=0:
> In file included from <command-line>:
> ./include/linux/rwsem.h: In function ‘rwsem_assert_held_write_nolockdep’:
> ././include/linux/compiler_types.h:477:20: error: ‘asm’ operand 2 probably does not match constraints [-Werror]
>   477 | #define asm_inline asm __inline
> ...
> 
> For clang-19 with W=0:
> ./arch/x86/include/asm/jump_label.h:36:11: error: invalid operand for inline asm constraint 'i'
>    36 |         asm goto(ARCH_STATIC_BRANCH_ASM("%c0 + %c1", "%l[l_yes]")
>       |                  ^
> ./arch/x86/include/asm/jump_label.h:26:2: note: expanded from macro 'ARCH_STATIC_BRANCH_ASM'
>    26 |         "1: jmp " label " # objtool NOPs this \n\t"     \
> ...
> 
> However, these errors already exists before applying the change, I don't
> think this patch has anything to do with these errors. I'll send other
> patches to fix these errors.

So if some error pre-exists, you don't need to mention it.

> The following section shows the result of code size shrinking for
> different compilers, architectures and NR_CPUS. Additionally, I use QEMU
> to run an arm64 VM and perform some cpu-heavy workload to make sure no
> side effect or crashes will happen.
> 
> * Code size shrinking ( gcc-13 for x86_64 NR_CPUS=64 ):
> 
> x$ ./scripts/bloat-o-meter vmlinux_old vmlinux_new
> add/remove: 0/2 grow/shrink: 46/510 up/down: 464/-1733 (-1269)

[...]

> Total: Before=22438085, After=22436816, chg -0.01%

Please don't include the whole bloat-o-meter output! 

> * Code size shrkinking ( gcc-12 for x86_64 NR_CPUS=64 ):
> 
> $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new 
> add/remove: 4/4 grow/shrink: 57/500 up/down: 1026/-2028 (-1002)
> Total: Before=22453915, After=22452913, chg -0.00%
> 
> * Code size shrinking ( gcc-11 for x86_64 NR_CPUS=64 ):
> 
> $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new 
> add/remove: 0/2 grow/shrink: 36/518 up/down: 564/-1771 (-1207)
> Total: Before=22302033, After=22300826, chg -0.01%
> 
> * Code size shrinking ( gcc-13 for x86_64 with NR_CPUS=1024 ):
> 
> $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new
> add/remove: 0/0 grow/shrink: 9/132 up/down: 51/-503 (-452)
> Total: Before=22509812, After=22509487, chg -0.00%

Instead, can you summarize it in a nice looking table?

Compiler  NR_CPUS = 64   NR_CPUS = 1024   
gcc-13           -1269             -452
gcc-12           -1002              xxx
gcc-11           -1207              yyy

And so on.
 
> Every compiler, architectures combinations above can shrink the code
> size around 1k, except clang, which doesn't change anything before and
> after the change.
> 
> Analyzation of multiple functions of gcc-13 on x86_64 are performed,

s/Analyzation/Analysis

Can you run spell-checker too?

> trying to summarize what's the positive/negative effect of this change.
> 
> * free_acpi_perf_data
> >From bloat-o-meter, we can see the function has been shrink by 1 byte.
> ...
> free_acpi_perf_data                           75      74      -1
> ...
> 
> Disasembly it using
> $ objdump --disassemble=free_acpi_perf_data vmlinux_old > free_acpi_perf_data.old
> ...
> ffffffff81d632c3:	74 1a                	je     ffffffff81d632df <free_acpi_perf_data+0x3f>
> ffffffff81d632c5:	48 89 f8             	mov    %rdi,%rax
> ffffffff81d632c8:	48 d3 e0             	shl    %cl,%rax
> ffffffff81d632cb:	48 f7 d8             	neg    %rax
> ffffffff81d632ce:	48 21 d0             	and    %rdx,%rax
> ffffffff81d632d1:	74 0c                	je     ffffffff81d632df <free_acpi_perf_data+0x3f>
> ...
> 
> $ objdump --disassemble=free_acpi_perf_data vmlinux_new > free_acpi_perf_data.new
> ...
> ffffffff81d62ec5:	74 17                	je     ffffffff81d62ede <free_acpi_perf_data+0x3e>
> ffffffff81d62ec7:	48 89 fa             	mov    %rdi,%rdx
> ffffffff81d62eca:	48 d3 e2             	shl    %cl,%rdx
> ffffffff81d62ecd:	48 21 c2             	and    %rax,%rdx
> ffffffff81d62ed0:	74 0c                	je     ffffffff81d62ede <free_acpi_perf_data+0x3e>
> ...
> 
> Revert GENMASK() saves one negation here.
> 
> * tick_clock_notify
> >From bloat-o-meter, the shrink size is
> ...
> tick_clock_notify                            109     108      -1
> ...
> 
> $ objdump --disassemble=tick_clock_notify vmlinux_old > tick_clock_notify.old
> ...
> ffffffff81360694:	48 8b 05 05 fc 4d 01 	mov    0x14dfc05(%rip),%rax        # ffffffff828402a0 <__cpu_possible_mask>
> ffffffff8136069b:	48 85 c0             	test   %rax,%rax
> ffffffff8136069e:	74 58                	je     ffffffff813606f8 <tick_clock_notify+0x68>
> ffffffff813606a0:	f3 48 0f bc c0       	tzcnt  %rax,%rax
> ffffffff813606a5:	89 c1                	mov    %eax,%ecx
> ffffffff813606a7:	83 f8 3f             	cmp    $0x3f,%eax
> ffffffff813606aa:	77 4c                	ja     ffffffff813606f8 <tick_clock_notify+0x68>
> ffffffff813606ac:	be 01 00 00 00       	mov    $0x1,%esi
> ffffffff813606b1:	48 c7 c2 c0 0a 02 00 	mov    $0x20ac0,%rdx
> ffffffff813606b8:	48 63 c1             	movslq %ecx,%rax
> ffffffff813606bb:	48 8b 3c c5 a0 fb 83 	mov    -0x7d7c0460(,%rax,8),%rdi
> ffffffff813606c2:	82 
> ffffffff813606c3:	48 01 d7             	add    %rdx,%rdi
> ffffffff813606c6:	f0 80 8f e0 00 00 00 	lock orb $0x1,0xe0(%rdi)
> ffffffff813606cd:	01 
> ffffffff813606ce:	83 c1 01             	add    $0x1,%ecx
> ffffffff813606d1:	48 63 c1             	movslq %ecx,%rax
> ffffffff813606d4:	48 83 f8 3f          	cmp    $0x3f,%rax
> ffffffff813606d8:	77 1e                	ja     ffffffff813606f8 <tick_clock_notify+0x68>
> ffffffff813606da:	48 89 f0             	mov    %rsi,%rax
> ffffffff813606dd:	48 d3 e0             	shl    %cl,%rax
> ffffffff813606e0:	48 f7 d8             	neg    %rax
> ffffffff813606e3:	48 23 05 b6 fb 4d 01 	and    0x14dfbb6(%rip),%rax        # ffffffff828402a0 <__cpu_possible_mask>
> ...
> 
> $ objdump --disassemble=tick_clock_notify vmlinux_new > tick_clock_notify.new
> ...
> ffffffff81360564:	48 8b 05 35 fd 4d 01 	mov    0x14dfd35(%rip),%rax        # ffffffff828402a0 <__cpu_possible_mask>
> ffffffff8136056b:	48 85 c0             	test   %rax,%rax
> ffffffff8136056e:	74 57                	je     ffffffff813605c7 <tick_clock_notify+0x67>
> ffffffff81360570:	f3 48 0f bc c0       	tzcnt  %rax,%rax
> ffffffff81360575:	89 c1                	mov    %eax,%ecx
> ffffffff81360577:	83 f8 3f             	cmp    $0x3f,%eax
> ffffffff8136057a:	77 4b                	ja     ffffffff813605c7 <tick_clock_notify+0x67>
> ffffffff8136057c:	48 c7 c6 ff ff ff ff 	mov    $0xffffffffffffffff,%rsi
> ffffffff81360583:	48 c7 c2 c0 0a 02 00 	mov    $0x20ac0,%rdx
> ffffffff8136058a:	48 63 c1             	movslq %ecx,%rax
> ffffffff8136058d:	48 8b 3c c5 a0 fb 83 	mov    -0x7d7c0460(,%rax,8),%rdi
> ffffffff81360594:	82 
> ffffffff81360595:	48 01 d7             	add    %rdx,%rdi
> ffffffff81360598:	f0 80 8f e0 00 00 00 	lock orb $0x1,0xe0(%rdi)
> ffffffff8136059f:	01 
> ffffffff813605a0:	83 c1 01             	add    $0x1,%ecx
> ffffffff813605a3:	48 63 c1             	movslq %ecx,%rax
> ffffffff813605a6:	48 83 f8 3f          	cmp    $0x3f,%rax
> ffffffff813605aa:	77 1b                	ja     ffffffff813605c7 <tick_clock_notify+0x67>
> ffffffff813605ac:	48 89 f0             	mov    %rsi,%rax
> ffffffff813605af:	48 d3 e0             	shl    %cl,%rax
> ffffffff813605b2:	48 23 05 e7 fc 4d 01 	and    0x14dfce7(%rip),%rax        # ffffffff828402a0 <__cpu_possible_mask>
> ...
> 
> One negation is saved here, I also try to reproduce it on godbolt, the
> link is https://godbolt.org/z/ac4h1ov7f .
> 
> For negative impact, we pick the following function
> 
> * kstat_irqs_desc
> ...
> kstat_irqs_desc                              118     122      +4
> ...
> 
> $ objdump --disassemble=kstat_irqs_desc vmlinux_old > kstat_irqs_desc.old
> ...
> Disassembly of section .text:
> 
> ffffffff8130c320 <kstat_irqs_desc>:
> ffffffff8130c320:	f3 0f 1e fa          	endbr64
> ffffffff8130c324:	f7 47 78 00 02 02 00 	testl  $0x20200,0x78(%rdi)
> ffffffff8130c32b:	74 5b                	je     ffffffff8130c388 <kstat_irqs_desc+0x68>
> ffffffff8130c32d:	48 8b 36             	mov    (%rsi),%rsi
> ffffffff8130c330:	31 d2                	xor    %edx,%edx
> ffffffff8130c332:	48 85 f6             	test   %rsi,%rsi
> ffffffff8130c335:	74 4a                	je     ffffffff8130c381 <kstat_irqs_desc+0x61>
> ffffffff8130c337:	f3 48 0f bc c6       	tzcnt  %rsi,%rax
> ffffffff8130c33c:	89 c1                	mov    %eax,%ecx
> ffffffff8130c33e:	83 f8 3f             	cmp    $0x3f,%eax
> ffffffff8130c341:	77 3e                	ja     ffffffff8130c381 <kstat_irqs_desc+0x61>
> ffffffff8130c343:	41 b8 01 00 00 00    	mov    $0x1,%r8d
> ffffffff8130c349:	48 8b 7f 60          	mov    0x60(%rdi),%rdi
> ffffffff8130c34d:	48 63 c1             	movslq %ecx,%rax
> ffffffff8130c350:	83 c1 01             	add    $0x1,%ecx
> ffffffff8130c353:	48 8b 04 c5 a0 fb 83 	mov    -0x7d7c0460(,%rax,8),%rax
> ffffffff8130c35a:	82 
> ffffffff8130c35b:	03 14 38             	add    (%rax,%rdi,1),%edx
> ffffffff8130c35e:	48 63 c1             	movslq %ecx,%rax
> ffffffff8130c361:	48 83 f8 3f          	cmp    $0x3f,%rax
> ffffffff8130c365:	77 1a                	ja     ffffffff8130c381 <kstat_irqs_desc+0x61>
> ffffffff8130c367:	4c 89 c0             	mov    %r8,%rax
> ffffffff8130c36a:	48 d3 e0             	shl    %cl,%rax
> ffffffff8130c36d:	48 f7 d8             	neg    %rax
> ffffffff8130c370:	48 21 f0             	and    %rsi,%rax
> ffffffff8130c373:	74 0c                	je     ffffffff8130c381 <kstat_irqs_desc+0x61>
> ffffffff8130c375:	f3 48 0f bc c0       	tzcnt  %rax,%rax
> ffffffff8130c37a:	89 c1                	mov    %eax,%ecx
> ffffffff8130c37c:	83 f8 3f             	cmp    $0x3f,%eax
> ffffffff8130c37f:	76 cc                	jbe    ffffffff8130c34d <kstat_irqs_desc+0x2d>
> ffffffff8130c381:	89 d0                	mov    %edx,%eax
> ffffffff8130c383:	e9 58 93 e8 00       	jmp    ffffffff821956e0 <__x86_return_thunk>
> ffffffff8130c388:	f6 47 7d 20          	testb  $0x20,0x7d(%rdi)
> ffffffff8130c38c:	75 9f                	jne    ffffffff8130c32d <kstat_irqs_desc+0xd>
> ffffffff8130c38e:	8b 97 88 00 00 00    	mov    0x88(%rdi),%edx
> ffffffff8130c394:	eb eb                	jmp    ffffffff8130c381 <kstat_irqs_desc+0x61>
> ...
> 
> $ objdump --disassemble=kstat_irqs_desc vmlinux_new >
> kstat_irqs_desc.new
> ...
> Disassembly of section .text:
> 
> ffffffff8130c240 <kstat_irqs_desc>:
> ffffffff8130c240:	f3 0f 1e fa          	endbr64
> ffffffff8130c244:	f7 47 78 00 02 02 00 	testl  $0x20200,0x78(%rdi)
> ffffffff8130c24b:	74 57                	je     ffffffff8130c2a4 <kstat_irqs_desc+0x64>
> ffffffff8130c24d:	48 8b 36             	mov    (%rsi),%rsi
> ffffffff8130c250:	31 c0                	xor    %eax,%eax
> ffffffff8130c252:	48 85 f6             	test   %rsi,%rsi
> ffffffff8130c255:	74 48                	je     ffffffff8130c29f <kstat_irqs_desc+0x5f>
> ffffffff8130c257:	f3 48 0f bc d6       	tzcnt  %rsi,%rdx
> ffffffff8130c25c:	89 d1                	mov    %edx,%ecx
> ffffffff8130c25e:	83 fa 3f             	cmp    $0x3f,%edx
> ffffffff8130c261:	77 52                	ja     ffffffff8130c2b5 <kstat_irqs_desc+0x75>
> ffffffff8130c263:	49 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%r8
> ffffffff8130c26a:	48 8b 7f 60          	mov    0x60(%rdi),%rdi
> ffffffff8130c26e:	48 63 d1             	movslq %ecx,%rdx
> ffffffff8130c271:	83 c1 01             	add    $0x1,%ecx
> ffffffff8130c274:	48 8b 14 d5 a0 fb 83 	mov    -0x7d7c0460(,%rdx,8),%rdx
> ffffffff8130c27b:	82 
> ffffffff8130c27c:	03 04 3a             	add    (%rdx,%rdi,1),%eax
> ffffffff8130c27f:	48 63 d1             	movslq %ecx,%rdx
> ffffffff8130c282:	48 83 fa 3f          	cmp    $0x3f,%rdx
> ffffffff8130c286:	77 17                	ja     ffffffff8130c29f <kstat_irqs_desc+0x5f>
> ffffffff8130c288:	4c 89 c2             	mov    %r8,%rdx
> ffffffff8130c28b:	48 d3 e2             	shl    %cl,%rdx
> ffffffff8130c28e:	48 21 f2             	and    %rsi,%rdx
> ffffffff8130c291:	74 0c                	je     ffffffff8130c29f <kstat_irqs_desc+0x5f>
> ffffffff8130c293:	f3 48 0f bc d2       	tzcnt  %rdx,%rdx
> ffffffff8130c298:	89 d1                	mov    %edx,%ecx
> ffffffff8130c29a:	83 fa 3f             	cmp    $0x3f,%edx
> ffffffff8130c29d:	76 cf                	jbe    ffffffff8130c26e <kstat_irqs_desc+0x2e>
> ffffffff8130c29f:	e9 3c 94 e8 00       	jmp    ffffffff821956e0 <__x86_return_thunk>
> ffffffff8130c2a4:	f6 47 7d 20          	testb  $0x20,0x7d(%rdi)
> ffffffff8130c2a8:	75 a3                	jne    ffffffff8130c24d <kstat_irqs_desc+0xd>
> ffffffff8130c2aa:	8b 87 88 00 00 00    	mov    0x88(%rdi),%eax
> ffffffff8130c2b0:	e9 2b 94 e8 00       	jmp    ffffffff821956e0 <__x86_return_thunk>
> ffffffff8130c2b5:	e9 26 94 e8 00       	jmp    ffffffff821956e0 <__x86_return_thunk>

Alright, two identical jumps one after another. Ohh. Anyways,
GENMASK() doesn't seem to be involved, and the code generation is
overall improved.

> ...
> 
> We can find that GENMASK() indeed saves 1 negation and one "mov", but at
> thee same time it generate one more "jmp" and make another "jmp"

s/thee/the

> instruction become "e9 2b 94 e8 00" instead of "eb eb".
> 
> * nr_processes
> $ objdump --disassemble=nr_processes vmlinux_old > nr_processes.old
> ...
> ffffffff812873c3:	77 1a                	ja     ffffffff812873df <nr_processes+0x5f>
> ffffffff812873c5:	4c 89 c0             	mov    %r8,%rax
> ffffffff812873c8:	48 d3 e0             	shl    %cl,%rax
> ffffffff812873cb:	48 f7 d8             	neg    %rax
> ffffffff812873ce:	48 21 d0             	and    %rdx,%rax
> ffffffff812873d1:	74 0c                	je     ffffffff812873df <nr_processes+0x5f>
> ffffffff812873d3:	f3 48 0f bc c0       	tzcnt  %rax,%rax
> ffffffff812873d8:	89 c1                	mov    %eax,%ecx
> ffffffff812873da:	83 f8 3f             	cmp    $0x3f,%eax
> ffffffff812873dd:	76 cc                	jbe    ffffffff812873ab <nr_processes+0x2b>
> ffffffff812873df:	89 f0                	mov    %esi,%eax
> ffffffff812873e1:	e9 fa e2 f0 00       	jmp    ffffffff821956e0 <__x86_return_thunk>
> ...
> 
> $ objdump --disassemble=nr_processes vmlinux_new > nr_processes.new
> ...
> ffffffff812873c4:	77 1a                	ja     ffffffff812873e0 <nr_processes+0x60>
> ffffffff812873c6:	4c 89 c6             	mov    %r8,%rsi
> ffffffff812873c9:	48 d3 e6             	shl    %cl,%rsi
> ffffffff812873cc:	48 89 f1             	mov    %rsi,%rcx
> ffffffff812873cf:	48 21 c1             	and    %rax,%rcx
> ffffffff812873d2:	74 0c                	je     ffffffff812873e0 <nr_processes+0x60>
> ffffffff812873d4:	f3 48 0f bc f1       	tzcnt  %rcx,%rsi
> ffffffff812873d9:	89 f1                	mov    %esi,%ecx
> ffffffff812873db:	83 fe 3f             	cmp    $0x3f,%esi
> ffffffff812873de:	76 cc                	jbe    ffffffff812873ac <nr_processes+0x2c>
> ffffffff812873e0:	89 d0                	mov    %edx,%eax
> ffffffff812873e2:	e9 f9 e2 f0 00       	jmp    ffffffff821956e0 <__x86_return_thunk>
> ...
> 
> GENMASK() can save 1 negation but generate 1 more "mov" thus adds 1 more
> register to use.
> 
> In summary, GENMASK() can elimate 1 negation for almost every use cases,

I think only this summary is important. If you'd like to add a typical
example of code generation improvement, it's OK. But I generally trust
you, so don't put every function you've disassembled here.

> but some use cases may generate additional "mov" or register in use.
> The use of "~_UL(0) << (l)" may even result in use of %r* registers instead of
> using $e* which is 32 bits registers, because compiler can't make
> assumption that the higher bits are zeroes. ( I'm not super sure whether
> it's the true cause, let me know if anything needs to be corrected or
> need more tests, thanks. )

To me, it looks OK. GCC does weird things, but the code generation
looks better now. Adding Kees and Maxim as compiler experts. Guys,
if you have a minute, can you take a quick look at the code generation?
In case of red flags, please let me know.

Thanks,
Yury

> v1 -> v2:
> 	- Refer the patch explicitly as a revert of commit c32ee3d
> 	- Squash commits into 1 commit
> 	- Perform compilation for numerous compilers, architectures and
> 	  compare code size shrink for each of them
> 	- Perform cpu-heavy workload test inside x86_64 VM and ARM64 VM
> 	- Analyze the result of disassembly of several small use cases
> 
> ---
>  include/uapi/linux/bits.h | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> index 5ee30f882736..39e344a1227e 100644
> --- a/include/uapi/linux/bits.h
> +++ b/include/uapi/linux/bits.h
> @@ -5,12 +5,10 @@
>  #define _UAPI_LINUX_BITS_H
>  
>  #define __GENMASK(h, l) \
> -        (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> -         (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
> +(((~_UL(0)) << (l)) & (~_UL(0) >> (BITS_PER_LONG - 1 - (h))))
>  
>  #define __GENMASK_ULL(h, l) \
> -        (((~_ULL(0)) - (_ULL(1) << (l)) + 1) & \
> -         (~_ULL(0) >> (__BITS_PER_LONG_LONG - 1 - (h))))
> +(((~_ULL(0)) << (l)) & (~_ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
>  
>  #define __GENMASK_U128(h, l) \
>  	((_BIT128((h)) << 1) - (_BIT128(l)))
> -- 
> 2.43.0