linux-kernel - [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250805120435.1142283-1-liqiang01@kylinos.cn>
Date: Tue,  5 Aug 2025 20:04:35 +0800
From: Li Qiang <liqiang01@...inos.cn>
To: akpm@...ux-foundation.org,
	david@...hat.com
Cc: linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	lorenzo.stoakes@...cle.com,
	Liam.Howlett@...cle.com,
	vbabka@...e.cz,
	rppt@...nel.org,
	surenb@...gle.com,
	mhocko@...e.com
Subject: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

Ah, missed it after the performance numbers. As Vlastimil mentioned, I 
would have expected a bloat-o-meter output.

> 
> My 2 cents is that usually it may be better to understand why it is
> not inlined and address that (e.g., likely() hints or something else)
> instead of blindly putting __always_inline. The __always_inline might
> stay there for no reason after some code changes and therefore become
> a maintenance burden. Concretely, in this case, where there is a single
> caller, one can expect the compiler to really prefer to inline the
> callees.

>
> Agreed, although the compiler is sometimes hard to convince to do the 
> right thing when dealing with rather large+complicated code in my 
> experience.

Question 1: Will this patch increase the vmlinux size?
Reply:
	Actually, the overall vmlinux size becomes smaller on x86_64:
	[root@...alhost linux_old1]# ./scripts/bloat-o-meter before.vmlinux after.vmlinux  
	add/remove: 6/0 grow/shrink: 0/1 up/down: 4569/-4747 (-178)  
	Function                                     old     new   delta  
	zap_present_ptes.constprop                     -    2696   +2696  
	zap_pte_range                                  -    1236   +1236  
	zap_pmd_range.isra                             -     589    +589  
	__pfx_zap_pte_range                            -      16     +16  
	__pfx_zap_present_ptes.constprop               -      16     +16  
	__pfx_zap_pmd_range.isra                       -      16     +16  
	unmap_page_range                            5765    1018   -4747  
	Total: Before=35379786, After=35379608, chg -0.00%  


Question 2: Why doesn't GCC inline these functions by default? Are there any side effects of forced inlining?
Reply:
	1) GCC's default parameter max-inline-insns-single imposes restrictions. However, since these are leaf functions, inlining them not only improves performance but also reduces code size. May we consider relaxing the max-inline-insns-single restriction in this case?

	2) The functions being inlined in this patch follow a single call path and are ultimately inlined into unmap_page_range. This only increases the size of the unmap_page_range assembly function, but since unmap_page_range itself won't be further inlined, the impact is well-contained.



Question 3: Does this inlining modification affect code maintainability?
Reply: The modified inline functions are exclusively called by unmap_page_range, forming a single call path. This doesn't introduce additional maintenance complexity.


Question 4: Have you performed performance testing on other platforms? Have you tested other scenarios?
Reply:
	1) I tested the same GCC version on arm64 architecture. Even without this patch, these functions get inlined into unmap_page_range automatically. This appears to be due to architecture-specific differences in GCC's max-inline-insns-single default values.

	2) I believe UnixBench serves as a reasonably representative server benchmark. Theoretically, this patch should improve performance by reducing multi-layer function call overhead. However, I would sincerely appreciate your guidance on what additional tests might better demonstrate the performance improvements. Could you kindly suggest some specific benchmarks or test scenarios I should consider?

--
Cheers,

Li Qiang