linux-kernel - Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a70b1979-5a91-40a2-a1ba-326aff2208ae@lucifer.local>
Date: Tue, 16 Dec 2025 10:47:21 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
        linux-mm@...ck.org, Will Deacon <will@...nel.org>,
        "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Nick Piggin <npiggin@...il.com>, Peter Zijlstra <peterz@...radead.org>,
        Arnd Bergmann <arnd@...db.de>, Muchun Song <muchun.song@...ux.dev>,
        Oscar Salvador <osalvador@...e.de>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, Rik van Riel <riel@...riel.com>,
        Harry Yoo <harry.yoo@...cle.com>,
        Laurence Oberman <loberman@...hat.com>,
        Prakash Sangappa <prakash.sangappa@...cle.com>,
        Nadav Amit <nadav.amit@...il.com>, stable@...r.kernel.org
Subject: Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when
 unsharing PMD tables using mmu_gather

On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
> As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
> huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
> where we perform so many IPI broadcasts when unsharing hugetlb PMD page
> tables that it severely regresses some workloads.
>
> In particular, when we fork()+exit(), or when we munmap() a large
> area backed by many shared PMD tables, we perform one IPI broadcast per
> unshared PMD table.
>
> There are two optimizations to be had:
>
> (1) When we process (unshare) multiple such PMD tables, such as during
>     exit(), it is sufficient to send a single IPI broadcast (as long as
>     we respect locking rules) instead of one per PMD table.
>
>     Locking prevents that any of these PMD tables could get reuse before
>     we drop the lock.
>
> (2) When we are not the last sharer (> 2 users including us), there is
>     no need to send the IPI broadcast. The shared PMD tables cannot
>     become exclusive (fully unshared) before an IPI will be broadcasted
>     by the last sharer.
>
>     Concurrent GUP-fast could walk into a PMD table just before we
>     unshared it. It could then succeed in grabbing a page from the
>     shared page table even after munmap() etc succeeded (and supressed
>     an IPI). But there is not difference compared to GUP-fast just
>     sleeping for a while after grabbing the page and re-enabling IRQs.
>
>     Most importantly, GUP-fast will never walk into page tables that are
>     no-longer shared, because the last sharer will issue an IPI
>     broadcast.
>
>     (if ever required, checking whether the PUD changed in GUP-fast
>      after grabbing the page like we do in the PTE case could handle
>      this)
>
> So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
> infrastructure so we can implement these optimizations and demystify the
> code at least a bit. Extend the mmu_gather infrastructure to be able to
> deal with our special hugetlb PMD table sharing implementation.
>
> We'll consolidate the handling for (full) unsharing of PMD tables in
> tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
> in "struct mmu_gather" whether we had (full) unsharing of PMD tables.
>
> Because locking is very special (concurrent unsharing+reuse must be
> prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
> require an explicit earlier call to tlb_flush_unshared_tables().
>
> From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
> that the expected lock protecting us from concurrent unsharing+reuse is
> still held.
>
> Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
> tlb_flush_unshared_tables() was properly called earlier.
>
> Document it all properly.
>
> Notes about tlb_remove_table_sync_one() interaction with unsharing:
>
> There are two fairly tricky things:
>
> (1) tlb_remove_table_sync_one() is a NOP on architectures without
>     CONFIG_MMU_GATHER_RCU_TABLE_FREE.
>
>     Here, the assumption is that the previous TLB flush would send an
>     IPI to all relevant CPUs. Careful: some architectures like x86 only
>     send IPIs to all relevant CPUs when tlb->freed_tables is set.
>
>     The relevant architectures should be selecting
>     MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
>     kernels and it might have been problematic before this patch.
>
>     Also, the arch flushing behavior (independent of IPIs) is different
>     when tlb->freed_tables is set. Do we have to enlighten them to also
>     take care of tlb->unshared_tables? So far we didn't care, so
>     hopefully we are fine. Of course, we could be setting
>     tlb->freed_tables as well, but that might then unnecessarily flush
>     too much, because the semantics of tlb->freed_tables are a bit
>     fuzzy.
>
>     This patch changes nothing in this regard.
>
> (2) tlb_remove_table_sync_one() is not a NOP on architectures with
>     CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
>
>     Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
>     we still issue IPIs during TLB flushes and don't actually need the
>     second tlb_remove_table_sync_one().
>
>     This optimized can be implemented on top of this, by checking e.g., in
>     tlb_remove_table_sync_one() whether we really need IPIs. But as
>     described in (1), it really must honor tlb->freed_tables then to
>     send IPIs to all relevant CPUs.
>
> Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
> concern, as we are holding the i_mmap_lock the whole time, preventing
> concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
> separately as a cleanup later.
>
> There are plenty more cleanups to be had, but they have to wait until
> this is fixed.
>
> Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
> Reported-by: Uschakow, Stanislav" <suschako@...zon.de>
> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
> Tested-by: Laurence Oberman <loberman@...hat.com>
> Cc: <stable@...r.kernel.org>
> Signed-off-by: David Hildenbrand (Red Hat) <david@...nel.org>

After discussion on v1 4/4, and running a git range-diff between the two, this
LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>

> ---
>  include/asm-generic/tlb.h |  74 ++++++++++++++++++++++-
>  include/linux/hugetlb.h   |  19 +++---
>  mm/hugetlb.c              | 121 ++++++++++++++++++++++----------------
>  mm/mmu_gather.c           |   7 +++
>  mm/mprotect.c             |   2 +-
>  mm/rmap.c                 |  25 +++++---
>  6 files changed, 179 insertions(+), 69 deletions(-)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 1fff717cae510..706416babb3d6 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -364,6 +364,20 @@ struct mmu_gather {
>  	unsigned int		vma_huge : 1;
>  	unsigned int		vma_pfn  : 1;
>
> +	/*
> +	 * Did we unshare (unmap) any shared page tables? For now only
> +	 * used for hugetlb PMD table sharing.
> +	 */
> +	unsigned int		unshared_tables : 1;
> +
> +	/*
> +	 * Did we unshare any page tables such that they are now exclusive
> +	 * and could get reused+modified by the new owner? When setting this
> +	 * flag, "unshared_tables" will be set as well. For now only used
> +	 * for hugetlb PMD table sharing.
> +	 */
> +	unsigned int		fully_unshared_tables : 1;
> +
>  	unsigned int		batch_count;
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
> @@ -400,6 +414,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>  	tlb->cleared_pmds = 0;
>  	tlb->cleared_puds = 0;
>  	tlb->cleared_p4ds = 0;
> +	tlb->unshared_tables = 0;
>  	/*
>  	 * Do not reset mmu_gather::vma_* fields here, we do not
>  	 * call into tlb_start_vma() again to set them if there is an
> @@ -484,7 +499,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>  	 * these bits.
>  	 */
>  	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> -	      tlb->cleared_puds || tlb->cleared_p4ds))
> +	      tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
>  		return;
>
>  	tlb_flush(tlb);
> @@ -773,6 +788,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
>  }
>  #endif
>
> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
> +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
> +					  unsigned long addr)
> +{
> +	/*
> +	 * The caller must make sure that concurrent unsharing + exclusive
> +	 * reuse is impossible until tlb_flush_unshared_tables() was called.
> +	 */
> +	VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
> +	ptdesc_pmd_pts_dec(pt);
> +
> +	/* Clearing a PUD pointing at a PMD table with PMD leaves. */
> +	tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
> +
> +	/*
> +	 * If the page table is now exclusively owned, we fully unshared
> +	 * a page table.
> +	 */
> +	if (!ptdesc_pmd_is_shared(pt))
> +		tlb->fully_unshared_tables = true;
> +	tlb->unshared_tables = true;
> +}
> +
> +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
> +{
> +	/*
> +	 * As soon as the caller drops locks to allow for reuse of
> +	 * previously-shared tables, these tables could get modified and
> +	 * even reused outside of hugetlb context, so we have to make sure that
> +	 * any page table walkers (incl. TLB, GUP-fast) are aware of that
> +	 * change.
> +	 *
> +	 * Even if we are not fully unsharing a PMD table, we must
> +	 * flush the TLB for the unsharer now.
> +	 */
> +	if (tlb->unshared_tables)
> +		tlb_flush_mmu_tlbonly(tlb);
> +
> +	/*
> +	 * Similarly, we must make sure that concurrent GUP-fast will not
> +	 * walk previously-shared page tables that are getting modified+reused
> +	 * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
> +	 *
> +	 * We only perform this when we are the last sharer of a page table,
> +	 * as the IPI will reach all CPUs: any GUP-fast.
> +	 *
> +	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
> +	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
> +	 * required IPIs already for us.
> +	 */
> +	if (tlb->fully_unshared_tables) {
> +		tlb_remove_table_sync_one();
> +		tlb->fully_unshared_tables = false;
> +	}
> +}
> +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
> +
>  #endif /* CONFIG_MMU */
>
>  #endif /* _ASM_GENERIC__TLB_H */
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 03c8725efa289..63b248c6bfd47 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_page(struct hstate *h);
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep);
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end);
>
> @@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
>  int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
>  void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
>  void hugetlb_vma_lock_release(struct kref *kref);
> -long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end, pgprot_t newprot,
>  		unsigned long cp_flags);
>  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
> @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
>  	return NULL;
>  }
>
> -static inline int huge_pmd_unshare(struct mm_struct *mm,
> -					struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +static inline int huge_pmd_unshare(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
>
> +static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma)
> +{
> +}
> +
>  static inline void adjust_range_if_pmd_sharing_possible(
>  				struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
> @@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio,
>  {
>  }
>
> -static inline long hugetlb_change_protection(
> +static inline long hugetlb_change_protection(struct mmu_gather *tlb,
>  			struct vm_area_struct *vma, unsigned long address,
>  			unsigned long end, pgprot_t newprot,
>  			unsigned long cp_flags)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c77cdef12a32..7fef0b94b5d1e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  	unsigned long last_addr_mask;
>  	pte_t *src_pte, *dst_pte;
>  	struct mmu_notifier_range range;
> -	bool shared_pmd = false;
> +	struct mmu_gather tlb;
>
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr,
>  				old_end);
>  	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> @@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
>  			continue;
>
> -		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
> -			shared_pmd = true;
> +		if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) {
>  			old_addr |= last_addr_mask;
>  			new_addr |= last_addr_mask;
>  			continue;
>  		}
> +		tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
>
>  		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
>  		if (!dst_pte)
> @@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
>  	}
>
> -	if (shared_pmd)
> -		flush_hugetlb_tlb_range(vma, range.start, range.end);
> -	else
> -		flush_hugetlb_tlb_range(vma, old_end - len, old_end);
> +	tlb_flush_mmu_tlbonly(&tlb);
> +	huge_pmd_unshare_flush(&tlb, vma);
> +
>  	mmu_notifier_invalidate_range_end(&range);
>  	i_mmap_unlock_write(mapping);
>  	hugetlb_vma_unlock_write(vma);
> +	tlb_finish_mmu(&tlb);
>
>  	return len + old_addr - old_end;
>  }
> @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	unsigned long sz = huge_page_size(h);
>  	bool adjust_reservation;
>  	unsigned long last_addr_mask;
> -	bool force_flush = false;
>
>  	WARN_ON(!is_vm_hugetlb_page(vma));
>  	BUG_ON(start & ~huge_page_mask(h));
> @@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		}
>
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, address, ptep)) {
> +		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
>  			spin_unlock(ptl);
> -			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
> -			force_flush = true;
>  			address |= last_addr_mask;
>  			continue;
>  		}
> @@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	}
>  	tlb_end_vma(tlb, vma);
>
> -	/*
> -	 * There is nothing protecting a previously-shared page table that we
> -	 * unshared through huge_pmd_unshare() from getting freed after we
> -	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> -	 * succeeded, flush the range corresponding to the pud.
> -	 */
> -	if (force_flush)
> -		tlb_flush_mmu_tlbonly(tlb);
> +	huge_pmd_unshare_flush(tlb, vma);
>  }
>
>  void __hugetlb_zap_begin(struct vm_area_struct *vma,
> @@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>
> -long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)
>  {
> @@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
>  	long pages = 0, psize = huge_page_size(h);
> -	bool shared_pmd = false;
>  	struct mmu_notifier_range range;
>  	unsigned long last_addr_mask;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> @@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  			}
>  		}
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, address, ptep)) {
> +		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
>  			/*
>  			 * When uffd-wp is enabled on the vma, unshare
>  			 * shouldn't happen at all.  Warn about it if it
> @@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
>  			pages++;
>  			spin_unlock(ptl);
> -			shared_pmd = true;
>  			address |= last_addr_mask;
>  			continue;
>  		}
> @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  				pte = huge_pte_clear_uffd_wp(pte);
>  			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
>  			pages++;
> +			tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
>  		}
>
>  next:
>  		spin_unlock(ptl);
>  		cond_resched();
>  	}
> -	/*
> -	 * There is nothing protecting a previously-shared page table that we
> -	 * unshared through huge_pmd_unshare() from getting freed after we
> -	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> -	 * succeeded, flush the range corresponding to the pud.
> -	 */
> -	if (shared_pmd)
> -		flush_hugetlb_tlb_range(vma, range.start, range.end);
> -	else
> -		flush_hugetlb_tlb_range(vma, start, end);
> +
> +	tlb_flush_mmu_tlbonly(tlb);
> +	huge_pmd_unshare_flush(tlb, vma);
>  	/*
>  	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
>  	 * downgrading page table protection not changing it to point to a new
> @@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return pte;
>  }
>
> -/*
> - * unmap huge page backed by shared pte.
> +/**
> + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users
> + * @tlb: the current mmu_gather.
> + * @vma: the vma covering the pmd table.
> + * @addr: the address we are trying to unshare.
> + * @ptep: pointer into the (pmd) page table.
> + *
> + * Called with the page table lock held, the i_mmap_rwsem held in write mode
> + * and the hugetlb vma lock held in write mode.
>   *
> - * Called with page table lock held.
> + * Note: The caller must call huge_pmd_unshare_flush() before dropping the
> + * i_mmap_rwsem.
>   *
> - * returns: 1 successfully unmapped a shared pte page
> - *	    0 the underlying pte page is not shared, or it is the last user
> + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it
> + *	    was not a shared PMD table.
>   */
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep)
>  {
>  	unsigned long sz = huge_page_size(hstate_vma(vma));
> +	struct mm_struct *mm = vma->vm_mm;
>  	pgd_t *pgd = pgd_offset(mm, addr);
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	pud_t *pud = pud_offset(p4d, addr);
> @@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>  	hugetlb_vma_assert_locked(vma);
>  	pud_clear(pud);
> -	/*
> -	 * Once our caller drops the rmap lock, some other process might be
> -	 * using this page table as a normal, non-hugetlb page table.
> -	 * Wait for pending gup_fast() in other threads to finish before letting
> -	 * that happen.
> -	 */
> -	tlb_remove_table_sync_one();
> -	ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
> +
> +	tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr);
> +
>  	mm_dec_nr_pmds(mm);
>  	return 1;
>  }
>
> +/*
> + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
> + * @tlb: the current mmu_gather.
> + * @vma: the vma covering the pmd table.
> + *
> + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
> + * unsharing with concurrent page table walkers.
> + *
> + * This function must be called after a sequence of huge_pmd_unshare()
> + * calls while still holding the i_mmap_rwsem.
> + */
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
> +{
> +	/*
> +	 * We must synchronize page table unsharing such that nobody will
> +	 * try reusing a previously-shared page table while it might still
> +	 * be in use by previous sharers (TLB, GUP_fast).
> +	 */
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +
> +	tlb_flush_unshared_tables(tlb);
> +}
> +
>  #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
>
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return NULL;
>  }
>
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep)
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
>
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
> +{
> +}
> +
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
>  {
> @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	unsigned long sz = huge_page_size(h);
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
>  	unsigned long address;
>  	spinlock_t *ptl;
>  	pte_t *ptep;
> @@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	if (start >= end)
>  		return;
>
> +	tlb_gather_mmu(&tlb, mm);
>  	flush_cache_range(vma, start, end);
>  	/*
>  	 * No need to call adjust_range_if_pmd_sharing_possible(), because
> @@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  		if (!ptep)
>  			continue;
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		huge_pmd_unshare(mm, vma, address, ptep);
> +		huge_pmd_unshare(&tlb, vma, address, ptep);
>  		spin_unlock(ptl);
>  	}
> -	flush_hugetlb_tlb_range(vma, start, end);
> +	huge_pmd_unshare_flush(&tlb, vma);
>  	if (take_locks) {
>  		i_mmap_unlock_write(vma->vm_file->f_mapping);
>  		hugetlb_vma_unlock_write(vma);
> @@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	 * Documentation/mm/mmu_notifier.rst.
>  	 */
>  	mmu_notifier_invalidate_range_end(&range);
> +	tlb_finish_mmu(&tlb);
>  }
>
>  /*
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 247e3f9db6c7a..030a162a263ba 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -426,6 +426,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>  #endif
>  	tlb->vma_pfn = 0;
>
> +	tlb->fully_unshared_tables = 0;
>  	__tlb_reset_range(tlb);
>  	inc_tlb_flush_pending(tlb->mm);
>  }
> @@ -468,6 +469,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
>   */
>  void tlb_finish_mmu(struct mmu_gather *tlb)
>  {
> +	/*
> +	 * We expect an earlier huge_pmd_unshare_flush() call to sort this out,
> +	 * due to complicated locking requirements with page table unsharing.
> +	 */
> +	VM_WARN_ON_ONCE(tlb->fully_unshared_tables);
> +
>  	/*
>  	 * If there are parallel threads are doing PTE changes on same range
>  	 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 283889e4f1cec..5c330e817129e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb,
>  #endif
>
>  	if (is_vm_hugetlb_page(vma))
> -		pages = hugetlb_change_protection(vma, start, end, newprot,
> +		pages = hugetlb_change_protection(tlb, vma, start, end, newprot,
>  						  cp_flags);
>  	else
>  		pages = change_protection_range(tlb, vma, start, end, newprot,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 748f48727a162..d6799afe11147 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -76,7 +76,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/oom.h>
>
> -#include <asm/tlbflush.h>
> +#include <asm/tlb.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/migrate.h>
> @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * if unsuccessful.
>  			 */
>  			if (!anon) {
> +				struct mmu_gather tlb;
> +
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  				if (!hugetlb_vma_trylock_write(vma))
>  					goto walk_abort;
> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +
> +				tlb_gather_mmu(&tlb, mm);
> +				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>  					hugetlb_vma_unlock_write(vma);
> -					flush_tlb_range(vma,
> -						range.start, range.end);
> +					huge_pmd_unshare_flush(&tlb, vma);
> +					tlb_finish_mmu(&tlb);
>  					/*
>  					 * The PMD table was unmapped,
>  					 * consequently unmapping the folio.
> @@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  					goto walk_done;
>  				}
>  				hugetlb_vma_unlock_write(vma);
> +				tlb_finish_mmu(&tlb);
>  			}
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>  			if (pte_dirty(pteval))
> @@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * fail if unsuccessful.
>  			 */
>  			if (!anon) {
> +				struct mmu_gather tlb;
> +
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  				if (!hugetlb_vma_trylock_write(vma)) {
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
>  				}
> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> -					hugetlb_vma_unlock_write(vma);
> -					flush_tlb_range(vma,
> -						range.start, range.end);
>
> +				tlb_gather_mmu(&tlb, mm);
> +				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
> +					hugetlb_vma_unlock_write(vma);
> +					huge_pmd_unshare_flush(&tlb, vma);
> +					tlb_finish_mmu(&tlb);
>  					/*
>  					 * The PMD table was unmapped,
>  					 * consequently unmapping the folio.
> @@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  					break;
>  				}
>  				hugetlb_vma_unlock_write(vma);
> +				tlb_finish_mmu(&tlb);
>  			}
>  			/* Nuke the hugetlb page table entry */
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> --
> 2.52.0
>