linux-kernel - Re: [PATCH v2 0/3] skip redundant TLB sync IPIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fc3c20a9-69a2-41eb-9f22-8df262717348@linux.dev>
Date: Sat, 3 Jan 2026 16:39:06 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: Dave Hansen <dave.hansen@...el.com>,
 "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: will@...nel.org, aneesh.kumar@...nel.org, npiggin@...il.com,
 peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, arnd@...db.de,
 lorenzo.stoakes@...cle.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
 Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
 dev.jain@....com, baohua@...nel.org, ioworker0@...il.com,
 shy828301@...il.com, riel@...riel.com, jannh@...gle.com,
 linux-arch@...r.kernel.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, akpm@...ux-foundation.org
Subject: Re: [PATCH v2 0/3] skip redundant TLB sync IPIs



On 2026/1/3 00:41, Dave Hansen wrote:
> On 12/31/25 04:33, David Hildenbrand (Red Hat) wrote:
>> On 12/31/25 05:26, Dave Hansen wrote:
>>> On 12/29/25 06:52, Lance Yang wrote:
>>> ...
>>>> This series introduces a way for architectures to indicate their TLB
>>>> flush
>>>> already provides full synchronization, allowing the redundant IPI to be
>>>> skipped. For now, the optimization is implemented for x86 first and
>>>> applied
>>>> to all page table operations that free or unshare tables.
>>>
>>> I really don't like all the complexity here. Even on x86, there are
>>> three or more ways of deriving this. Having the pv_ops check the value
>>> of another pv op is also a bit unsettling.
>>
>> Right. What I actually meant is that we simply have a property "bool
>> flush_tlb_multi_implies_ipi_broadcast" that we set only to true from the
>> initialization code.
>>
>> Without comparing the pv_ops.
>>
>> That should reduce the complexity quite a bit IMHO.
> 
> Yeah, that sounds promising.

Thanks a lot for taking the time to review!

Yeah, I simplified things to just a bool property set during init
(no pv_ops comparison at runtime) as follows:

---8<---
diff --git a/arch/x86/include/asm/paravirt.h 
b/arch/x86/include/asm/paravirt.h
index 13f9cd31c8f8..a926d459e6f5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -698,6 +698,7 @@ static __always_inline unsigned long 
arch_local_irq_save(void)

  extern void default_banner(void);
  void native_pv_lock_init(void) __init;
+void setup_pv_tlb_flush_ipi_broadcast(void) __init;

  #else  /* __ASSEMBLER__ */

@@ -727,6 +728,10 @@ void native_pv_lock_init(void) __init;
  static inline void native_pv_lock_init(void)
  {
  }
+
+static inline void setup_pv_tlb_flush_ipi_broadcast(void)
+{
+}
  #endif
  #endif /* !CONFIG_PARAVIRT */

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..7c010d8bee60 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,12 @@ struct pv_mmu_ops {
  	void (*flush_tlb_multi)(const struct cpumask *cpus,
  				const struct flush_tlb_info *info);

+	/*
+	 * Indicates whether flush_tlb_multi IPIs provide sufficient
+	 * synchronization for GUP-fast when freeing or unsharing page tables.
+	 */
+	bool flush_tlb_multi_implies_ipi_broadcast;
+
  	/* Hook for intercepting the destruction of an mm_struct. */
  	void (*exit_mmap)(struct mm_struct *mm);
  	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, 
bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..f570c7b2d03e 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,23 @@
  #define tlb_flush tlb_flush
  static inline void tlb_flush(struct mmu_gather *tlb);

+#define tlb_table_flush_implies_ipi_broadcast 
tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
  #include <asm-generic/tlb.h>
  #include <linux/kernel.h>
  #include <vdso/bits.h>
  #include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+	return pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast;
+#else
+	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+#endif
+}

  static inline void tlb_flush(struct mmu_gather *tlb)
  {
@@ -20,7 +33,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
  		end = tlb->end;
  	}

-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables);
  }

  static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..0a49c2d79693 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -60,6 +60,23 @@ void __init native_pv_lock_init(void)
  		static_branch_enable(&virt_spin_lock_key);
  }

+void __init setup_pv_tlb_flush_ipi_broadcast(void)
+{
+	/*
+	 * For native TLB flush, if we don't have INVLPGB, we use IPI-based
+	 * flushing which sends real IPIs to all CPUs. This provides sufficient
+	 * synchronization for GUP-fast.
+	 *
+	 * For paravirt (e.g., KVM, Xen, HyperV), hypercalls may not send real
+	 * IPIs, so we keep the default value of false. Only set to true when
+	 * using native flush_tlb_multi without INVLPGB.
+	 */
+	if (pv_ops.mmu.flush_tlb_multi == native_flush_tlb_multi &&
+	    !cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = true;
+}
+
+
  struct static_key paravirt_steal_enabled;
  struct static_key paravirt_steal_rq_enabled;

@@ -173,6 +190,7 @@ struct paravirt_patch_template pv_ops = {
  	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
  	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
  	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
+	.mmu.flush_tlb_multi_implies_ipi_broadcast = false,

  	.mmu.exit_mmap		= paravirt_nop,
  	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 74aa904be6dc..3f673e686b12 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
  #include <asm/olpc_ofw.h>
  #include <asm/pci-direct.h>
  #include <asm/prom.h>
+#include <asm/paravirt.h>
  #include <asm/proto.h>
  #include <asm/realmode.h>
  #include <asm/thermal.h>
@@ -1257,6 +1258,7 @@ void __init setup_arch(char **cmdline_p)
  	io_apic_init_mappings();

  	x86_init.hyper.guest_late_init();
+	setup_pv_tlb_flush_ipi_broadcast();

  	e820__reserve_resources();
  	e820__register_nosave_regions(max_pfn);
---

> 
>> But maybe you have an even better way on how to indicate support, in a
>> very simple way.
> 
> Rather than having some kind of explicit support enumeration, the other
> idea I had would be to actually track the state about what needs to get
> flushed somewhere. For instance, even CPUs with enabled INVLPGB support
> still use IPIs sometimes. That makes the
> tlb_table_flush_implies_ipi_broadcast() check a bit imperfect as is
> because it will for the extra sync IPI even when INVLPGB isn't being
> used for an mm.
> 
> First, we already save some semblance of support for doing different
> flushes when freeing page tables mmu_gather->freed_tables. But, the call
> sites in question here are for a single flush and don't use mmu_gathers.
> 
> The other pretty straightforward thing to do would be to add something
> to mm->context that indicates that page tables need to be freed but
> there might still be wild gup walkers out there that need an IPI. It
> would get set when the page tables are modified and cleared at all the
> sites where an IPIs are sent.

Thanks for the suggestion! The mm->context tracking idea makes a lot
of sense - it would handle those mixed INVLPGB/IPI cases much better :)

Maybe we could do that as a follow-up. I'd like to keep things simple
for now, so we just add a bool property to skip redundant TLB sync IPIs
on systems without INVLPGB support.

Then we could add the mm->context (or something similar) tracking later
to handle things more precisely.

Anyway, I'm open to going straight to the mm->context approach as well
and happy to do that instead :D

Thanks,
Lance

> 
> 
>>> That said, complexity can be worth it with sufficient demonstrated
>>> gains. But:
>>>
>>>> When unsharing hugetlb PMD page tables or collapsing pages in
>>>> khugepaged,
>>>> we send two IPIs: one for TLB invalidation, and another to synchronize
>>>> with concurrent GUP-fast walkers.
>>>
>>> Those aren't exactly hot paths. khugepaged is fundamentally rate
>>> limited. I don't think unsharing hugetlb PMD page tables just is all
>>> that common either.
>>
>> Given that the added IPIs during unsharing broke Oracle DBs rather badly
>> [1], I think this is actually a case worth optimizing.
> ...
>> [1] https://lkml.kernel.org/r/20251223214037.580860-1-david@kernel.org
> 
> Gah, that's good context, thanks.
> 
> Are there any tests out there that might catch these this case better?
> It might be something good to have 0day watch for.