[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250528-x86-fix-vmap-pt-del-flush-v1-1-e250899ed160@google.com>
Date: Wed, 28 May 2025 22:30:27 +0200
From: Jann Horn <jannh@...gle.com>
To: Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>
Cc: Rik van Riel <riel@...riel.com>, Toshi Kani <toshi.kani@....com>,
linux-kernel@...r.kernel.org, stable@...r.kernel.org,
Jann Horn <jannh@...gle.com>
Subject: [PATCH] x86/mm: Fix paging-structure cache flush on kernel table
freeing
There are several issues in pud_free_pmd_page() and pmd_free_pte_page(),
which are used by vmalloc:
1. flush_tlb_kernel_range() does not flush paging-structure caches for
inactive PCIDs, but we're using it when freeing kernel page tables.
2. flush_tlb_kernel_range() only flushes paging-structure cache entries
that intersect with the flush range, and pud_free_pmd_page() passes in
the first PAGE_SIZE bytes of the area covered by the PMD table; but
pud_free_pmd_page() is actually designed to also remove PTE tables that
were installed in the PMD table, so we need to also flush those.
3. [theoretical issue] invlpgb_kernel_range_flush() expects an exclusive
range, it does: `nr = (info->end - addr) >> PAGE_SHIFT;`
But pmd_free_pte_page() and pud_free_pmd_page() provide inclusive
ranges (`addr, addr + PAGE_SIZE-1`).
To fix it:
1. Add a new helper flush_tlb_kernel_pgtable_range() for flushing paging
structure caches for kernel page tables, and use that.
2. Flush PUD_SIZE instead of PAGE_SIZE in pud_free_pmd_page().
3. Remove `-1` in end address calculations.
Note that since I'm not touching invlpgb_kernel_range_flush() or
invlpgb_flush_addr_nosync() in this patch, the flush on PMD table deletion
with INVLPGB might be a bit slow due to using PTE_STRIDE instead of
PMD_STRIDE. I don't think that matters.
Backport notes:
Kernels before 6.16 don't have invlpgb_kernel_range_flush(); for them,
kernel_tlb_flush_pgtable() should just unconditionally call
flush_tlb_all().
I am marking this as fixing commit 28ee90fe6048 ("x86/mm: implement free
pmd/pte page interfaces"); that is technically incorrect, since back then
no paging-structure cache invalidations were performed at all, but probably
makes the most sense here.
Cc: stable@...r.kernel.org
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Jann Horn <jannh@...gle.com>
---
arch/x86/include/asm/tlb.h | 4 ++++
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/pgtable.c | 13 +++++++++----
arch/x86/mm/tlb.c | 37 +++++++++++++++++++++++++++++++++++++
4 files changed, 51 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..56a4b93a0742 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -153,6 +153,10 @@ static inline void invlpgb_flush_all(void)
/* Flush addr, including globals, for all PCIDs. */
static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
{
+ /*
+ * Don't set INVLPGB_FLAG_FINAL_ONLY here without adjusting
+ * kernel_tlb_flush_pgtable().
+ */
__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e9b81876ebe4..06a4a2b7a9f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -318,6 +318,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 62777ba4de1a..0779ed02226c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -745,8 +745,13 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
pud_clear(pud);
- /* INVLPG to clear all paging-structure caches */
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+ /*
+ * Clear paging-structure caches.
+ * Note that since this function can remove a PMD table together with the
+ * PTE tables it points to, we can't just flush the first PAGE_SIZE, we
+ * must flush PUD_SIZE!
+ */
+ flush_tlb_kernel_pgtable_range(addr, addr + PUD_SIZE);
for (i = 0; i < PTRS_PER_PMD; i++) {
if (!pmd_none(pmd_sv[i])) {
@@ -778,8 +783,8 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
pte = (pte_t *)pmd_page_vaddr(*pmd);
pmd_clear(pmd);
- /* INVLPG to clear all paging-structure caches */
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+ /* clear paging-structure caches */
+ flush_tlb_kernel_pgtable_range(addr, addr + PAGE_SIZE);
free_page((unsigned long)pte);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..26b78451a7ed 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1525,6 +1525,22 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
on_each_cpu(do_kernel_range_flush, info, 1);
}
+static void kernel_tlb_flush_pgtable(struct flush_tlb_info *info)
+{
+ /*
+ * The special thing about removing kernel page tables is that we can't
+ * use a flush that just removes global TLB entries; we need one that
+ * flushes paging structure caches across all PCIDs.
+ * With INVLPGB, that's explicitly supported.
+ * Otherwise, there's no good way (INVLPG and INVPCID can't specifically
+ * target one address across all PCIDs), just throw everything away.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ invlpgb_kernel_range_flush(info);
+ else
+ flush_tlb_all();
+}
+
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
struct flush_tlb_info *info;
@@ -1542,6 +1558,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
put_flush_tlb_info();
}
+/*
+ * Flush paging-structure cache entries for page tables covering the specified
+ * range and synchronize with concurrent hardware page table walks, in
+ * preparation for freeing page tables in the region.
+ * This must not be used for flushing translations to data pages.
+ */
+void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end)
+{
+ struct flush_tlb_info *info;
+
+ /* We don't synchronize against GUP-fast. */
+ VM_WARN_ON(start < TASK_SIZE_MAX);
+
+ guard(preempt)();
+
+ info = get_flush_tlb_info(NULL, start, end, PMD_SHIFT, true,
+ TLB_GENERATION_INVALID);
+ kernel_tlb_flush_pgtable(info);
+ put_flush_tlb_info();
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
---
base-commit: b1456f6dc167f7f101746e495bede2bac3d0e19f
change-id: 20250528-x86-fix-vmap-pt-del-flush-f9fa4f287c93
--
Jann Horn <jannh@...gle.com>
Powered by blists - more mailing lists