[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c453dcc-2837-4f1a-905b-3462270f5e31@lucifer.local>
Date: Tue, 14 Oct 2025 15:21:33 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Huang Ying <ying.huang@...ux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Ryan Roberts <ryan.roberts@....com>,
Yang Shi <yang@...amperecomputing.com>,
"Christoph Lameter (Ampere)" <cl@...two.org>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Anshuman Khandual <anshuman.khandual@....com>,
Yicong Yang <yangyicong@...ilicon.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Kevin Brodsky <kevin.brodsky@....com>,
Yin Fengwei <fengwei_yin@...ux.alibaba.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge
pmd
On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
> In the current kernel, there is spurious fault fixing support for pte,
> but not for huge pmd because no architectures need it. But in the
> next patch in the series, we will change the write protection fault
> handling logic on arm64, so that some stale huge pmd entries may
> remain in the TLB. These entries need to be flushed via the huge pmd
> spurious fault fixing mechanism.
>
> Signed-off-by: Huang Ying <ying.huang@...ux.alibaba.com>
Right now the PTE level spurious fault handling is dealt with in
handle_pte_fault() when ptep_set_access_flags() returns false.
Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
huge_pmd_set_accessed().
1 - Why are you not adding handling to GUP?
2 - Is this the correct level of abstraction? It's really not obvious but
huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
non-NUMA hint huge page fault where a page table entry already exists
but we are faulting anyway (e.g. non-present or read-only writable).
You don't mention any of this in the commit message, which you need to do
and really need to explain how spurious faults can arise, why you can only
do this at the point of abstraction you do (if you are unable to put it in
actual fault handing-code), and you need to add a bunch more comments to
explain this.
Otherwise this just ends up being a lot of open-coded + confusing 'you have
to go look it up/just know' type stuff that we have too much of in mm :)
So please update commit message/comments, confirm whether this is the
correct level of abstraction, and address other comments below, thanks!
> Cc: Catalin Marinas <catalin.marinas@....com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@....com>
> Cc: Yang Shi <yang@...amperecomputing.com>
> Cc: "Christoph Lameter (Ampere)" <cl@...two.org>
> Cc: Dev Jain <dev.jain@....com>
> Cc: Barry Song <baohua@...nel.org>
> Cc: Anshuman Khandual <anshuman.khandual@....com>
> Cc: Yicong Yang <yangyicong@...ilicon.com>
> Cc: Kefeng Wang <wangkefeng.wang@...wei.com>
> Cc: Kevin Brodsky <kevin.brodsky@....com>
> Cc: Yin Fengwei <fengwei_yin@...ux.alibaba.com>
> Cc: linux-arm-kernel@...ts.infradead.org
> Cc: linux-kernel@...r.kernel.org
> Cc: linux-mm@...ck.org
> ---
> include/linux/pgtable.h | 4 ++++
> mm/huge_memory.c | 22 +++++++++++++++++-----
> mm/internal.h | 4 ++--
> 3 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 32e8457ad535..341622ec80e4 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> #endif
>
> +#ifndef flush_tlb_fix_spurious_fault_pmd
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
> +#endif
flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
flush_tlb_page() - why do we just do nothing in this case here?
> +
> /*
> * When walking page tables, get the address of the next boundary,
> * or the end address of the range if that comes earlier. Although no
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..8533457c52b7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
> EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> - pmd_t *pmd, bool write)
> +/* Returns whether the PMD entry is changed */
Could we have a kernel doc description here?
> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
It's 2025 can we use bool please :)
> + pmd_t *pmd, bool write)
> {
> + int changed;
> pmd_t _pmd;
While we're here can we rename this horrible parameter name to e.g. entry? We're
significantly altering this function anyway so it isn't much more
>
> _pmd = pmd_mkyoung(*pmd);
> if (write)
> _pmd = pmd_mkdirty(_pmd);
> - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> - pmd, _pmd, write))
> + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> + pmd, _pmd, write);
> + if (changed)
> update_mmu_cache_pmd(vma, addr, pmd);
We can make this simpler, e.g.:
if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
pmd, entry, write)) {
update_mmu_cache_pmd(vma, addr, pmd);
return true;
}
return false;
> +
> + return changed;
> }
>
> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
> if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
> goto unlock;
>
> - touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
> + if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
> + /* See corresponding comments in handle_pte_fault(). */
What are the 'corresponding' comments? How can a reader of this code know what
they are? This isn't a very helpful comment. Also those comments might be
moved in future...
Presumably it's:
/* Skip spurious TLB flush for retried page fault */
if (vmf->flags & FAULT_FLAG_TRIED)
goto unlock;
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (vmf->flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
vmf->pte);
So I don't see why it's so egregious to have the equivalent here, or actually
ideally to abstract the code entirely.
In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
pgtable_level"") David introduced:
enum pgtable_level {
PGTABLE_LEVEL_PTE = 0,
PGTABLE_LEVEL_PMD,
PGTABLE_LEVEL_PUD,
PGTABLE_LEVEL_P4D,
PGTABLE_LEVEL_PGD,
};
Which allows for sensible abstraction.
> + if (vmf->flags & FAULT_FLAG_TRIED)
> + goto unlock;
> + if (vmf->flags & FAULT_FLAG_WRITE)
> + flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
> + vmf->pmd);
> + }
>
> unlock:
> spin_unlock(vmf->ptl);
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..8b58ab00a7cd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
> */
> void touch_pud(struct vm_area_struct *vma, unsigned long addr,
> pud_t *pud, bool write);
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> - pmd_t *pmd, bool write);
> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, bool write);
>
> /*
> * Parses a string with mem suffixes into its order. Useful to parse kernel
> --
> 2.39.5
>
Powered by blists - more mailing lists