linux-kernel - Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab22e314-63d1-46cf-a54c-b2af8db4d97a@lucifer.local>
Date: Mon, 4 Aug 2025 14:29:11 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Li Qiang <liqiang01@...inos.cn>
Cc: akpm@...ux-foundation.org, david@...hat.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Liam.Howlett@...cle.com, vbabka@...e.cz,
        rppt@...nel.org, surenb@...gle.com, mhocko@...e.com
Subject: Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for
 performance

On Mon, Aug 04, 2025 at 08:39:23PM +0800, Li Qiang wrote:
> This change converts several critical page table zapping functions from
> `inline` to `__always_inline`, resulting in measurable performance
> improvements in process spawning workloads.
>
> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
> - UnixBench 'context1' test shows ~6% improvement (single-core)
> - UnixBench  shows ~0.6% improvement (single-core)

These aren't exactly earth-shattering. Are we sure these are representative
of anything real-world representative of real workloads?

Spawning a bazillion processes is not really meaningful.

> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
> - Net code reduction of 1745 bytes (add/remove: 211/166)
>
> The modified functions form a hot path during process teardown:
> 1. zap_present_ptes()
> 2. do_zap_pte_range()
> 3. zap_pte_range()
> 4. zap_pmd_range()
>
> Signed-off-by: Li Qiang <liqiang01@...inos.cn>

I think others have covered this well, but we've had patches like this before
where, in essence, it's a case of 'improves things on my machine'.

The question really is _why_ your compiler is not making these inline in
the first place.

I'm no compiler expert, but the inline here I believe is redundant anyway
within a compilation unit so the compiler will make an inline decision
regardless.

These are pretty big functions though. You're essentially inlining
everything into a mega function in unmap_page_range(). Which seems iffy.

I wonder if we might see degradation in other workloads? And you're talking
about one architecture, not others...

I feel like you'd really need to justify with information on the compiler
(ideally with insights into why it's not inlining now), how it impacts
other architectures, _real workloads_ you've observed this matter for,
etc. for this to be justifiable.

Also are you sure it has to be _every_ level in the hierarchy? What happens
if you inline only e.g. zap_present_ptes(), as we do with
zap_present_folio_ptes() already?

(Fact that's _also_ inlined makes this a mega giant chonker inlined
function also...).

I guess bloat is less of an issue as it's all going inside a non-inlined
function.

But how this behaves in places other than 'not entirely convincing
benchmark on one architecture/uarch' is key here I think.

I don't think I'll really be convinced until there's quite a bit more data
to back this up with real-world usage.

> ---
>  mm/memory.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..281a353fae7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>   *
>   * Returns the number of processed (skipped or zapped) PTEs (at least 1).
>   */
> -static inline int zap_present_ptes(struct mmu_gather *tlb,
> +static __always_inline int zap_present_ptes(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>  		unsigned int max_nr, unsigned long addr,
>  		struct zap_details *details, int *rss, bool *force_flush,
> @@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
>  	return nr;
>  }
>
> -static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline int do_zap_pte_range(struct mmu_gather *tlb,
>  				   struct vm_area_struct *vma, pte_t *pte,
>  				   unsigned long addr, unsigned long end,
>  				   struct zap_details *details, int *rss,
> @@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>  	return nr;
>  }
>
> -static unsigned long zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pud_t *pud,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, p4d_t *p4d,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> @@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>
> -static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
> --
> 2.25.1
>