[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3d5cb9a4-6604-4302-a110-3d8ff91baa56@kernel.org>
Date: Mon, 9 Feb 2026 09:49:36 +0100
From: "David Hildenbrand (Arm)" <david@...nel.org>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>, akpm@...ux-foundation.org,
catalin.marinas@....com, will@...nel.org
Cc: lorenzo.stoakes@...cle.com, ryan.roberts@....com,
Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com,
mhocko@...e.com, riel@...riel.com, harry.yoo@...cle.com, jannh@...gle.com,
willy@...radead.org, baohua@...nel.org, dev.jain@....com,
linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references
for large folios
On 12/26/25 07:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
>
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>
> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
> ---
> include/linux/mmu_notifier.h | 9 +++++----
> include/linux/pgtable.h | 31 +++++++++++++++++++++++++++++++
> mm/rmap.c | 31 ++++++++++++++++++++++++++++---
> 3 files changed, 64 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..07a2bbaf86e9 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
> range->owner = owner;
> }
>
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
> ({ \
> int __young; \
> struct vm_area_struct *___vma = __vma; \
> unsigned long ___address = __address; \
> - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
> + unsigned int ___nr = __nr; \
> + __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
> __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
> ___address, \
> ___address + \
> - PAGE_SIZE); \
> + ___nr * PAGE_SIZE); \
> __young; \
> })
>
Man that's ugly, Not your fault, but can this possibly be turned into an
inline function in a follow-up patch.
[...]
>
> +#ifndef clear_flush_young_ptes
> +/**
> + * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
> + * that map consecutive pages of the same folio.
With clear_young_dirty_ptes() description in mind, this should probably
be "Mark PTEs that map consecutive pages of the same folio as clean and
flush the TLB" ?
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear access bit.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_clear_flush_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock. The PTEs map consecutive
> + * pages that belong to the same folio. The PTEs are all in the same PMD.
> + */
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr)
Two-tab alignment on second+ line like all similar functions here.
> +{
> + int i, young = 0;
> +
> + for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
> + young |= ptep_clear_flush_young(vma, addr, ptep);
> +
Why don't we use a similar loop we use in clear_young_dirty_ptes() or
clear_full_ptes() etc? It's not only consistent but also optimizes out
the first check for nr.
for (;;) {
young |= ptep_clear_flush_young(vma, addr, ptep);
if (--nr == 0)
break;
ptep++;
addr += PAGE_SIZE;
}
> + return young;
> +}
> +#endif
> +
> /*
> * On some architectures hardware does not set page access bit when accessing
> * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e805ddc5a27b..985ab0b085ba 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,9 +828,11 @@ static bool folio_referenced_one(struct folio *folio,
> struct folio_referenced_arg *pra = arg;
> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> int ptes = 0, referenced = 0;
> + unsigned int nr;
>
> while (page_vma_mapped_walk(&pvmw)) {
> address = pvmw.address;
> + nr = 1;
>
> if (vma->vm_flags & VM_LOCKED) {
> ptes++;
> @@ -875,9 +877,24 @@ static bool folio_referenced_one(struct folio *folio,
> if (lru_gen_look_around(&pvmw))
> referenced++;
> } else if (pvmw.pte) {
> - if (ptep_clear_flush_young_notify(vma, address,
> - pvmw.pte))
> + if (folio_test_large(folio)) {
> + unsigned long end_addr =
> + pmd_addr_end(address, vma->vm_end);
> + unsigned int max_nr =
> + (end_addr - address) >> PAGE_SHIFT;
Good news: you can both fit into a single line as we are allowed to
exceed 80c if it aids readability.
> + pte_t pteval = ptep_get(pvmw.pte);
> +
> + nr = folio_pte_batch(folio, pvmw.pte,
> + pteval, max_nr);
> + }
> +
> + ptes += nr;
I'm not sure about whether we should mess with the "ptes" variable that
is so far only used for VM_LOCKED vmas. See below, maybe we can just
avoid that.
> + if (clear_flush_young_ptes_notify(vma, address,
> + pvmw.pte, nr))
Could maybe fit that into a single line as well, whatever you prefer.
> referenced++;
> + /* Skip the batched PTEs */
> + pvmw.pte += nr - 1;
> + pvmw.address += (nr - 1) * PAGE_SIZE;
> } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> if (pmdp_clear_flush_young_notify(vma, address,
> pvmw.pmd))
> @@ -887,7 +904,15 @@ static bool folio_referenced_one(struct folio *folio,
> WARN_ON_ONCE(1);
> }
>
> - pra->mapcount--;
> + pra->mapcount -= nr;
> + /*
> + * If we are sure that we batched the entire folio,
> + * we can just optimize and stop right here.
> + */
> + if (ptes == pvmw.nr_pages) {
> + page_vma_mapped_walk_done(&pvmw);
> + break;
> + }
Why not check for !pra->mapcount? Then you can also drop the comment,
because it's exactly the same thing we check after the loop to indicate
what to return to the caller.
And you will not have to mess with the "ptes" variable?
Only minor stuff.
--
Cheers,
David
Powered by blists - more mailing lists