linux-kernel - Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ldlcpnbx.fsf@DESKTOP-5N7EMDA>
Date: Wed, 15 Oct 2025 16:43:14 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,  David Hildenbrand
 <david@...hat.com>
Cc: Catalin Marinas <catalin.marinas@....com>,  Will Deacon
 <will@...nel.org>,  Andrew Morton <akpm@...ux-foundation.org>,  Vlastimil
 Babka <vbabka@...e.cz>,  Zi Yan <ziy@...dia.com>,  Baolin Wang
 <baolin.wang@...ux.alibaba.com>,  Ryan Roberts <ryan.roberts@....com>,
  Yang Shi <yang@...amperecomputing.com>,  "Christoph Lameter (Ampere)"
 <cl@...two.org>,  Dev Jain <dev.jain@....com>,  Barry Song
 <baohua@...nel.org>,  Anshuman Khandual <anshuman.khandual@....com>,
  Yicong Yang <yangyicong@...ilicon.com>,  Kefeng Wang
 <wangkefeng.wang@...wei.com>,  Kevin Brodsky <kevin.brodsky@....com>,  Yin
 Fengwei <fengwei_yin@...ux.alibaba.com>,
  linux-arm-kernel@...ts.infradead.org,  linux-kernel@...r.kernel.org,
  linux-mm@...ck.org
Subject: Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd

Hi, Lorenzo,

Thanks for comments!

Lorenzo Stoakes <lorenzo.stoakes@...cle.com> writes:

> On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
>> In the current kernel, there is spurious fault fixing support for pte,
>> but not for huge pmd because no architectures need it. But in the
>> next patch in the series, we will change the write protection fault
>> handling logic on arm64, so that some stale huge pmd entries may
>> remain in the TLB. These entries need to be flushed via the huge pmd
>> spurious fault fixing mechanism.
>>
>> Signed-off-by: Huang Ying <ying.huang@...ux.alibaba.com>
>
> Right now the PTE level spurious fault handling is dealt with in
> handle_pte_fault() when ptep_set_access_flags() returns false.
>
> Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
> huge_pmd_set_accessed().
>
> 1 - Why are you not adding handling to GUP?
>
> 2 - Is this the correct level of abstraction? It's really not obvious but
>     huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
>     non-NUMA hint huge page fault where a page table entry already exists
>     but we are faulting anyway (e.g. non-present or read-only writable).
>
> You don't mention any of this in the commit message, which you need to do
> and really need to explain how spurious faults can arise, why you can only
> do this at the point of abstraction you do (if you are unable to put it in
> actual fault handing-code), and you need to add a bunch more comments to
> explain this.

This patch adds the spurious PMD page fault fixing based on the spurious
PTE page fault fixing.  So, I assumed that the spurious page fault
fixing has been documented already.  But you are right, nothing prevents
us from improving it further.  Let's try to do that.

The page faults may be spurious because of the racy access to the page
table.  For example, a non-populated virtual page is accessed on 2 CPUs
simultaneously, thus the page faults are triggered on both CPUs.
However, it's possible that one CPU (say CPU A) cannot find the reason
for the page fault if the other CPU (say CPU B) has changed the page
table before the PTE is checked on CPU A.  Most of the time, the
spurious page faults can be ignored safely.  However, if the page fault
is for the write access, it's possible that a stale read-only TLB entry
exists in the local CPU and needs to be flushed on some architectures.
This is called the spurious page fault fixing.

The spurious page fault fixing only makes sense during page fault
handling, so we don't need to do it for GUP.  In fact, I plan to avoid
it in all GUP paths in another followup patch.

As for where to put the spurious PMD page fault fixing code, because
it's THP related code, I thought that we should put it in huge_memory.c,
so I implemented it in huge_pmd_set_accessed().  If we follow the design
of the spurious PTE page fault fixing, we can call the unified
implementation in __handle_mm_fault() after huge_pmd_set_accessed()
reports nothing has been changed.

> Otherwise this just ends up being a lot of open-coded + confusing 'you have
> to go look it up/just know' type stuff that we have too much of in mm :)
>
> So please update commit message/comments, confirm whether this is the
> correct level of abstraction, and address other comments below, thanks!
>
>> Cc: Catalin Marinas <catalin.marinas@....com>
>> Cc: Will Deacon <will@...nel.org>
>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>> Cc: David Hildenbrand <david@...hat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>> Cc: Vlastimil Babka <vbabka@...e.cz>
>> Cc: Zi Yan <ziy@...dia.com>
>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@....com>
>> Cc: Yang Shi <yang@...amperecomputing.com>
>> Cc: "Christoph Lameter (Ampere)" <cl@...two.org>
>> Cc: Dev Jain <dev.jain@....com>
>> Cc: Barry Song <baohua@...nel.org>
>> Cc: Anshuman Khandual <anshuman.khandual@....com>
>> Cc: Yicong Yang <yangyicong@...ilicon.com>
>> Cc: Kefeng Wang <wangkefeng.wang@...wei.com>
>> Cc: Kevin Brodsky <kevin.brodsky@....com>
>> Cc: Yin Fengwei <fengwei_yin@...ux.alibaba.com>
>> Cc: linux-arm-kernel@...ts.infradead.org
>> Cc: linux-kernel@...r.kernel.org
>> Cc: linux-mm@...ck.org
>> ---
>>  include/linux/pgtable.h |  4 ++++
>>  mm/huge_memory.c        | 22 +++++++++++++++++-----
>>  mm/internal.h           |  4 ++--
>>  3 files changed, 23 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 32e8457ad535..341622ec80e4 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>>  #endif
>>
>> +#ifndef flush_tlb_fix_spurious_fault_pmd
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
>> +#endif
>
> flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
> flush_tlb_page() - why do we just do nothing in this case here?

Because all architectures do nothing for the spurious PMD page fault
fixing until the [2/2] of this series.  Where, we make it necessary to
flush the local TLB for spurious PMD page fault fixing on arm64
architecture.

If we follow the design of flush_tlb_fix_spurious_fault(), we need to
change all architecture implementation to do nothing in this patch to
keep the current behavior.  I don't think that it's a good idea.  Do
you agree?

>> +
>>  /*
>>   * When walking page tables, get the address of the next boundary,
>>   * or the end address of the range if that comes earlier.  Although no
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1b81680b4225..8533457c52b7 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>
>> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -	       pmd_t *pmd, bool write)
>> +/* Returns whether the PMD entry is changed */
>
> Could we have a kernel doc description here?

Sure.

>> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>
> It's 2025 can we use bool please :)

Sure.

>> +	      pmd_t *pmd, bool write)
>>  {
>> +	int changed;
>>  	pmd_t _pmd;
>
> While we're here can we rename this horrible parameter name to e.g. entry? We're
> significantly altering this function anyway so it isn't much more

Sure.

>>
>>  	_pmd = pmd_mkyoung(*pmd);
>>  	if (write)
>>  		_pmd = pmd_mkdirty(_pmd);
>> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>> -				  pmd, _pmd, write))
>> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>> +					pmd, _pmd, write);
>> +	if (changed)
>>  		update_mmu_cache_pmd(vma, addr, pmd);
>
> We can make this simpler, e.g.:
>
> 	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> 				  pmd, entry, write)) {
> 		update_mmu_cache_pmd(vma, addr, pmd);
> 		return true;
> 	}
>
> 	return false;

No problem.  As long as David is OK with this.

>> +
>> +	return changed;
>>  }
>>
>>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
>>  	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
>>  		goto unlock;
>>
>> -	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
>> +	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
>> +		/* See corresponding comments in handle_pte_fault(). */
>
> What are the 'corresponding' comments? How can a reader of this code know what
> they are? This isn't a very helpful comment. Also those comments might be
> moved in future...
>
> Presumably it's:
>
> 		/* Skip spurious TLB flush for retried page fault */
> 		if (vmf->flags & FAULT_FLAG_TRIED)
> 			goto unlock;
> 		/*
> 		 * This is needed only for protection faults but the arch code
> 		 * is not yet telling us if this is a protection fault or not.
> 		 * This still avoids useless tlb flushes for .text page faults
> 		 * with threads.
> 		 */
> 		if (vmf->flags & FAULT_FLAG_WRITE)
> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> 						     vmf->pte);
>
>
> So I don't see why it's so egregious to have the equivalent here, or actually
> ideally to abstract the code entirely.
>
> In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> pgtable_level"") David introduced:
>
> 	enum pgtable_level {
> 		PGTABLE_LEVEL_PTE = 0,
> 		PGTABLE_LEVEL_PMD,
> 		PGTABLE_LEVEL_PUD,
> 		PGTABLE_LEVEL_P4D,
> 		PGTABLE_LEVEL_PGD,
> 	};
>
> Which allows for sensible abstraction.

Sure.  Based on your discussion with David on this, I will add a
function to do the spurious page fault fixing for the PTE and PMD fault.

>> +		if (vmf->flags & FAULT_FLAG_TRIED)
>> +			goto unlock;
>> +		if (vmf->flags & FAULT_FLAG_WRITE)
>> +			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
>> +							 vmf->pmd);
>> +	}
>>
>>  unlock:
>>  	spin_unlock(vmf->ptl);
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 1561fc2ff5b8..8b58ab00a7cd 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
>>   */
>>  void touch_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	       pud_t *pud, bool write);
>> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -	       pmd_t *pmd, bool write);
>> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +	      pmd_t *pmd, bool write);
>>
>>  /*
>>   * Parses a string with mem suffixes into its order. Useful to parse kernel
>> --
>> 2.39.5
>>

---
Best Regards,
Huang, Ying