[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fc3c20a9-69a2-41eb-9f22-8df262717348@linux.dev>
Date: Sat, 3 Jan 2026 16:39:06 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: Dave Hansen <dave.hansen@...el.com>,
"David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: will@...nel.org, aneesh.kumar@...nel.org, npiggin@...il.com,
peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, arnd@...db.de,
lorenzo.stoakes@...cle.com, ziy@...dia.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
dev.jain@....com, baohua@...nel.org, ioworker0@...il.com,
shy828301@...il.com, riel@...riel.com, jannh@...gle.com,
linux-arch@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org
Subject: Re: [PATCH v2 0/3] skip redundant TLB sync IPIs
On 2026/1/3 00:41, Dave Hansen wrote:
> On 12/31/25 04:33, David Hildenbrand (Red Hat) wrote:
>> On 12/31/25 05:26, Dave Hansen wrote:
>>> On 12/29/25 06:52, Lance Yang wrote:
>>> ...
>>>> This series introduces a way for architectures to indicate their TLB
>>>> flush
>>>> already provides full synchronization, allowing the redundant IPI to be
>>>> skipped. For now, the optimization is implemented for x86 first and
>>>> applied
>>>> to all page table operations that free or unshare tables.
>>>
>>> I really don't like all the complexity here. Even on x86, there are
>>> three or more ways of deriving this. Having the pv_ops check the value
>>> of another pv op is also a bit unsettling.
>>
>> Right. What I actually meant is that we simply have a property "bool
>> flush_tlb_multi_implies_ipi_broadcast" that we set only to true from the
>> initialization code.
>>
>> Without comparing the pv_ops.
>>
>> That should reduce the complexity quite a bit IMHO.
>
> Yeah, that sounds promising.
Thanks a lot for taking the time to review!
Yeah, I simplified things to just a bool property set during init
(no pv_ops comparison at runtime) as follows:
---8<---
diff --git a/arch/x86/include/asm/paravirt.h
b/arch/x86/include/asm/paravirt.h
index 13f9cd31c8f8..a926d459e6f5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -698,6 +698,7 @@ static __always_inline unsigned long
arch_local_irq_save(void)
extern void default_banner(void);
void native_pv_lock_init(void) __init;
+void setup_pv_tlb_flush_ipi_broadcast(void) __init;
#else /* __ASSEMBLER__ */
@@ -727,6 +728,10 @@ void native_pv_lock_init(void) __init;
static inline void native_pv_lock_init(void)
{
}
+
+static inline void setup_pv_tlb_flush_ipi_broadcast(void)
+{
+}
#endif
#endif /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/include/asm/paravirt_types.h
b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..7c010d8bee60 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,12 @@ struct pv_mmu_ops {
void (*flush_tlb_multi)(const struct cpumask *cpus,
const struct flush_tlb_info *info);
+ /*
+ * Indicates whether flush_tlb_multi IPIs provide sufficient
+ * synchronization for GUP-fast when freeing or unsharing page tables.
+ */
+ bool flush_tlb_multi_implies_ipi_broadcast;
+
/* Hook for intercepting the destruction of an mm_struct. */
void (*exit_mmap)(struct mm_struct *mm);
void (*notify_page_enc_status_changed)(unsigned long pfn, int npages,
bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..f570c7b2d03e 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,23 @@
#define tlb_flush tlb_flush
static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_table_flush_implies_ipi_broadcast
tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
#include <asm-generic/tlb.h>
#include <linux/kernel.h>
#include <vdso/bits.h>
#include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+ return pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast;
+#else
+ return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+#endif
+}
static inline void tlb_flush(struct mmu_gather *tlb)
{
@@ -20,7 +33,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+ tlb->freed_tables || tlb->unshared_tables);
}
static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..0a49c2d79693 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -60,6 +60,23 @@ void __init native_pv_lock_init(void)
static_branch_enable(&virt_spin_lock_key);
}
+void __init setup_pv_tlb_flush_ipi_broadcast(void)
+{
+ /*
+ * For native TLB flush, if we don't have INVLPGB, we use IPI-based
+ * flushing which sends real IPIs to all CPUs. This provides sufficient
+ * synchronization for GUP-fast.
+ *
+ * For paravirt (e.g., KVM, Xen, HyperV), hypercalls may not send real
+ * IPIs, so we keep the default value of false. Only set to true when
+ * using native flush_tlb_multi without INVLPGB.
+ */
+ if (pv_ops.mmu.flush_tlb_multi == native_flush_tlb_multi &&
+ !cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = true;
+}
+
+
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;
@@ -173,6 +190,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi = native_flush_tlb_multi,
+ .mmu.flush_tlb_multi_implies_ipi_broadcast = false,
.mmu.exit_mmap = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 74aa904be6dc..3f673e686b12 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
#include <asm/olpc_ofw.h>
#include <asm/pci-direct.h>
#include <asm/prom.h>
+#include <asm/paravirt.h>
#include <asm/proto.h>
#include <asm/realmode.h>
#include <asm/thermal.h>
@@ -1257,6 +1258,7 @@ void __init setup_arch(char **cmdline_p)
io_apic_init_mappings();
x86_init.hyper.guest_late_init();
+ setup_pv_tlb_flush_ipi_broadcast();
e820__reserve_resources();
e820__register_nosave_regions(max_pfn);
---
>
>> But maybe you have an even better way on how to indicate support, in a
>> very simple way.
>
> Rather than having some kind of explicit support enumeration, the other
> idea I had would be to actually track the state about what needs to get
> flushed somewhere. For instance, even CPUs with enabled INVLPGB support
> still use IPIs sometimes. That makes the
> tlb_table_flush_implies_ipi_broadcast() check a bit imperfect as is
> because it will for the extra sync IPI even when INVLPGB isn't being
> used for an mm.
>
> First, we already save some semblance of support for doing different
> flushes when freeing page tables mmu_gather->freed_tables. But, the call
> sites in question here are for a single flush and don't use mmu_gathers.
>
> The other pretty straightforward thing to do would be to add something
> to mm->context that indicates that page tables need to be freed but
> there might still be wild gup walkers out there that need an IPI. It
> would get set when the page tables are modified and cleared at all the
> sites where an IPIs are sent.
Thanks for the suggestion! The mm->context tracking idea makes a lot
of sense - it would handle those mixed INVLPGB/IPI cases much better :)
Maybe we could do that as a follow-up. I'd like to keep things simple
for now, so we just add a bool property to skip redundant TLB sync IPIs
on systems without INVLPGB support.
Then we could add the mm->context (or something similar) tracking later
to handle things more precisely.
Anyway, I'm open to going straight to the mm->context approach as well
and happy to do that instead :D
Thanks,
Lance
>
>
>>> That said, complexity can be worth it with sufficient demonstrated
>>> gains. But:
>>>
>>>> When unsharing hugetlb PMD page tables or collapsing pages in
>>>> khugepaged,
>>>> we send two IPIs: one for TLB invalidation, and another to synchronize
>>>> with concurrent GUP-fast walkers.
>>>
>>> Those aren't exactly hot paths. khugepaged is fundamentally rate
>>> limited. I don't think unsharing hugetlb PMD page tables just is all
>>> that common either.
>>
>> Given that the added IPIs during unsharing broke Oracle DBs rather badly
>> [1], I think this is actually a case worth optimizing.
> ...
>> [1] https://lkml.kernel.org/r/20251223214037.580860-1-david@kernel.org
>
> Gah, that's good context, thanks.
>
> Are there any tests out there that might catch these this case better?
> It might be something good to have 0day watch for.
Powered by blists - more mailing lists