[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e9b27dd-1051-4e40-bd80-0fbbda957f0a@linux.dev>
Date: Tue, 6 Jan 2026 23:41:05 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: dave.hansen@...el.com, dave.hansen@...ux.intel.com, will@...nel.org,
aneesh.kumar@...nel.org, npiggin@...il.com, peterz@...radead.org,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, x86@...nel.org,
hpa@...or.com, arnd@...db.de, akpm@...ux-foundation.org,
lorenzo.stoakes@...cle.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
dev.jain@....com, baohua@...nel.org, shy828301@...il.com, riel@...riel.com,
jannh@...gle.com, linux-arch@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, ioworker0@...il.com
Subject: Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to
skip redundant IPI
On 2026/1/6 23:07, David Hildenbrand (Red Hat) wrote:
> On 1/6/26 13:03, Lance Yang wrote:
>> From: Lance Yang <lance.yang@...ux.dev>
>>
>> pmdp_collapse_flush() may already send IPIs to flush TLBs, and then
>> callers send another IPI via tlb_remove_table_sync_one() or
>> pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers.
>>
>> However, since GUP-fast runs with IRQs disabled, the TLB flush IPI
>> already
>> provides the necessary synchronization. We can avoid the redundant second
>> IPI.
>>
>> Introduce pmdp_collapse_flush_sync() which combines flush and sync:
>>
>> - For architectures using the generic pmdp_collapse_flush()
>> implementation
>> (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent
>> an IPI, tlb_gather_remove_table_sync_one() will skip the redundant
>> one.
>>
>> - For architectures with custom pmdp_collapse_flush() (s390, riscv,
>> powerpc): Fall back to calling pmdp_collapse_flush() followed by
>> tlb_remove_table_sync_one(). No behavior change.
>>
>> Update khugepaged to use pmdp_collapse_flush_sync() instead of separate
>> flush and sync calls. Remove the now-unused pmdp_get_lockless_sync()
>> macro.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@...nel.org>
>> Signed-off-by: Lance Yang <lance.yang@...ux.dev>
>> ---
>> include/linux/pgtable.h | 13 +++++++++----
>> mm/khugepaged.c | 9 +++------
>> mm/pgtable-generic.c | 34 ++++++++++++++++++++++++++++++++++
>> 3 files changed, 46 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index eb8aacba3698..69e290dab450 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>> return pmd;
>> }
>> #define pmdp_get_lockless pmdp_get_lockless
>> -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
>> #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>> @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>> {
>> return pmdp_get(pmdp);
>> }
>> -static inline void pmdp_get_lockless_sync(void)
>> -{
>> -}
>> #endif
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct
>> mm_struct *mm,
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>> unsigned long address, pmd_t *pmdp);
>> +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp);
>> #else
>> static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>> unsigned long address,
>> @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct
>> vm_area_struct *vma,
>> BUILD_BUG();
>> return *pmdp;
>> }
>> +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
>> + unsigned long address,
>> + pmd_t *pmdp)
>> +{
>> + BUILD_BUG();
>> + return *pmdp;
>> +}
>> #define pmdp_collapse_flush pmdp_collapse_flush
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> #endif
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9f790ec34400..0a98afc85c50 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1177,10 +1177,9 @@ static enum scan_result
>> collapse_huge_page(struct mm_struct *mm, unsigned long a
>> * Parallel GUP-fast is fine since GUP-fast will back off when
>> * it detects PMD is changed.
>> */
>> - _pmd = pmdp_collapse_flush(vma, address, pmd);
>> + _pmd = pmdp_collapse_flush_sync(vma, address, pmd);
>> spin_unlock(pmd_ptl);
>> mmu_notifier_invalidate_range_end(&range);
>> - tlb_remove_table_sync_one();
>
> Now you issue the IPI under PTL.
We do send TLB flush IPI under PTL before, e.g. in
try_collapse_pte_mapped_thp():
pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
pmdp_get_lockless_sync();
pte_unmap_unlock(start_pte, ptl);
But anyway, we can do better by passing ptl in and unlocking
before the sync IPI ;)
>
> [...]
>
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926a..be2ee82e6fc4 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct
>> *vma, unsigned long address,
>> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>> return pmd;
>> }
>> +
>> +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned
>> long address,
>> + pmd_t *pmdp)
>> +{
>> + struct mmu_gather tlb;
>> + pmd_t pmd;
>> +
>> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> + VM_BUG_ON(pmd_trans_huge(*pmdp));
>> +
>> + tlb_gather_mmu(&tlb, vma->vm_mm);
>
> Should we be using the new tlb_gather_mmu_vma(), and do we have to set
> the TLB pagesize to PMD?
Yes, good point on tlb_gather_mmu_vma()!
So, the sequence will be:
tlb_gather_mmu_vma(&tlb, vma);
pmd = pmdp_huge_get_and_clear(...);
flush_tlb_mm_range(..., &tlb);
if (ptl)
spin_unlock(ptl);
tlb_gather_remove_table_sync_one(&tlb);
tlb_finish_mmu(&tlb);Thanks,
Lance
Powered by blists - more mailing lists