[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab22e314-63d1-46cf-a54c-b2af8db4d97a@lucifer.local>
Date: Mon, 4 Aug 2025 14:29:11 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Li Qiang <liqiang01@...inos.cn>
Cc: akpm@...ux-foundation.org, david@...hat.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Liam.Howlett@...cle.com, vbabka@...e.cz,
rppt@...nel.org, surenb@...gle.com, mhocko@...e.com
Subject: Re: [PATCH] mm: memory: Force-inline PTE/PMD zapping functions for
performance
On Mon, Aug 04, 2025 at 08:39:23PM +0800, Li Qiang wrote:
> This change converts several critical page table zapping functions from
> `inline` to `__always_inline`, resulting in measurable performance
> improvements in process spawning workloads.
>
> Performance Impact (Intel Xeon Gold 6430 2.1GHz):
> - UnixBench 'context1' test shows ~6% improvement (single-core)
> - UnixBench shows ~0.6% improvement (single-core)
These aren't exactly earth-shattering. Are we sure these are representative
of anything real-world representative of real workloads?
Spawning a bazillion processes is not really meaningful.
> - mm/memory.o size reduced by 2.49% (70190 -> 68445 bytes)
> - Net code reduction of 1745 bytes (add/remove: 211/166)
>
> The modified functions form a hot path during process teardown:
> 1. zap_present_ptes()
> 2. do_zap_pte_range()
> 3. zap_pte_range()
> 4. zap_pmd_range()
>
> Signed-off-by: Li Qiang <liqiang01@...inos.cn>
I think others have covered this well, but we've had patches like this before
where, in essence, it's a case of 'improves things on my machine'.
The question really is _why_ your compiler is not making these inline in
the first place.
I'm no compiler expert, but the inline here I believe is redundant anyway
within a compilation unit so the compiler will make an inline decision
regardless.
These are pretty big functions though. You're essentially inlining
everything into a mega function in unmap_page_range(). Which seems iffy.
I wonder if we might see degradation in other workloads? And you're talking
about one architecture, not others...
I feel like you'd really need to justify with information on the compiler
(ideally with insights into why it's not inlining now), how it impacts
other architectures, _real workloads_ you've observed this matter for,
etc. for this to be justifiable.
Also are you sure it has to be _every_ level in the hierarchy? What happens
if you inline only e.g. zap_present_ptes(), as we do with
zap_present_folio_ptes() already?
(Fact that's _also_ inlined makes this a mega giant chonker inlined
function also...).
I guess bloat is less of an issue as it's all going inside a non-inlined
function.
But how this behaves in places other than 'not entirely convincing
benchmark on one architecture/uarch' is key here I think.
I don't think I'll really be convinced until there's quite a bit more data
to back this up with real-world usage.
> ---
> mm/memory.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..281a353fae7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1543,7 +1543,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> *
> * Returns the number of processed (skipped or zapped) PTEs (at least 1).
> */
> -static inline int zap_present_ptes(struct mmu_gather *tlb,
> +static __always_inline int zap_present_ptes(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> unsigned int max_nr, unsigned long addr,
> struct zap_details *details, int *rss, bool *force_flush,
> @@ -1662,7 +1662,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
> return nr;
> }
>
> -static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline int do_zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pte_t *pte,
> unsigned long addr, unsigned long end,
> struct zap_details *details, int *rss,
> @@ -1698,7 +1698,7 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
> return nr;
> }
>
> -static unsigned long zap_pte_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> @@ -1790,7 +1790,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> return addr;
> }
>
> -static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pud_t *pud,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> @@ -1832,7 +1832,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> return addr;
> }
>
> -static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, p4d_t *p4d,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> @@ -1861,7 +1861,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
> return addr;
> }
>
> -static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
> +static __always_inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pgd_t *pgd,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> --
> 2.25.1
>
Powered by blists - more mailing lists