lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250528-x86-fix-vmap-pt-del-flush-v1-1-e250899ed160@google.com>
Date: Wed, 28 May 2025 22:30:27 +0200
From: Jann Horn <jannh@...gle.com>
To: Dave Hansen <dave.hansen@...ux.intel.com>, 
 Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>
Cc: Rik van Riel <riel@...riel.com>, Toshi Kani <toshi.kani@....com>, 
 linux-kernel@...r.kernel.org, stable@...r.kernel.org, 
 Jann Horn <jannh@...gle.com>
Subject: [PATCH] x86/mm: Fix paging-structure cache flush on kernel table
 freeing

There are several issues in pud_free_pmd_page() and pmd_free_pte_page(),
which are used by vmalloc:

1. flush_tlb_kernel_range() does not flush paging-structure caches for
   inactive PCIDs, but we're using it when freeing kernel page tables.
2. flush_tlb_kernel_range() only flushes paging-structure cache entries
   that intersect with the flush range, and pud_free_pmd_page() passes in
   the first PAGE_SIZE bytes of the area covered by the PMD table; but
   pud_free_pmd_page() is actually designed to also remove PTE tables that
   were installed in the PMD table, so we need to also flush those.
3. [theoretical issue] invlpgb_kernel_range_flush() expects an exclusive
   range, it does: `nr = (info->end - addr) >> PAGE_SHIFT;`
   But pmd_free_pte_page() and pud_free_pmd_page() provide inclusive
   ranges (`addr, addr + PAGE_SIZE-1`).

To fix it:

1. Add a new helper flush_tlb_kernel_pgtable_range() for flushing paging
   structure caches for kernel page tables, and use that.
2. Flush PUD_SIZE instead of PAGE_SIZE in pud_free_pmd_page().
3. Remove `-1` in end address calculations.

Note that since I'm not touching invlpgb_kernel_range_flush() or
invlpgb_flush_addr_nosync() in this patch, the flush on PMD table deletion
with INVLPGB might be a bit slow due to using PTE_STRIDE instead of
PMD_STRIDE. I don't think that matters.

Backport notes:
Kernels before 6.16 don't have invlpgb_kernel_range_flush(); for them,
kernel_tlb_flush_pgtable() should just unconditionally call
flush_tlb_all().
I am marking this as fixing commit 28ee90fe6048 ("x86/mm: implement free
pmd/pte page interfaces"); that is technically incorrect, since back then
no paging-structure cache invalidations were performed at all, but probably
makes the most sense here.

Cc: stable@...r.kernel.org
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Jann Horn <jannh@...gle.com>
---
 arch/x86/include/asm/tlb.h      |  4 ++++
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/pgtable.c           | 13 +++++++++----
 arch/x86/mm/tlb.c               | 37 +++++++++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..56a4b93a0742 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -153,6 +153,10 @@ static inline void invlpgb_flush_all(void)
 /* Flush addr, including globals, for all PCIDs. */
 static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
+	/*
+	 * Don't set INVLPGB_FLAG_FINAL_ONLY here without adjusting
+	 * kernel_tlb_flush_pgtable().
+	 */
 	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 }
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e9b81876ebe4..06a4a2b7a9f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -318,6 +318,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 62777ba4de1a..0779ed02226c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -745,8 +745,13 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 
 	pud_clear(pud);
 
-	/* INVLPG to clear all paging-structure caches */
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+	/*
+	 * Clear paging-structure caches.
+	 * Note that since this function can remove a PMD table together with the
+	 * PTE tables it points to, we can't just flush the first PAGE_SIZE, we
+	 * must flush PUD_SIZE!
+	 */
+	flush_tlb_kernel_pgtable_range(addr, addr + PUD_SIZE);
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		if (!pmd_none(pmd_sv[i])) {
@@ -778,8 +783,8 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 	pte = (pte_t *)pmd_page_vaddr(*pmd);
 	pmd_clear(pmd);
 
-	/* INVLPG to clear all paging-structure caches */
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+	/* clear paging-structure caches */
+	flush_tlb_kernel_pgtable_range(addr, addr + PAGE_SIZE);
 
 	free_page((unsigned long)pte);
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..26b78451a7ed 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1525,6 +1525,22 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
 		on_each_cpu(do_kernel_range_flush, info, 1);
 }
 
+static void kernel_tlb_flush_pgtable(struct flush_tlb_info *info)
+{
+	/*
+	 * The special thing about removing kernel page tables is that we can't
+	 * use a flush that just removes global TLB entries; we need one that
+	 * flushes paging structure caches across all PCIDs.
+	 * With INVLPGB, that's explicitly supported.
+	 * Otherwise, there's no good way (INVLPG and INVPCID can't specifically
+	 * target one address across all PCIDs), just throw everything away.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_kernel_range_flush(info);
+	else
+		flush_tlb_all();
+}
+
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
@@ -1542,6 +1558,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	put_flush_tlb_info();
 }
 
+/*
+ * Flush paging-structure cache entries for page tables covering the specified
+ * range and synchronize with concurrent hardware page table walks, in
+ * preparation for freeing page tables in the region.
+ * This must not be used for flushing translations to data pages.
+ */
+void flush_tlb_kernel_pgtable_range(unsigned long start, unsigned long end)
+{
+	struct flush_tlb_info *info;
+
+	/* We don't synchronize against GUP-fast. */
+	VM_WARN_ON(start < TASK_SIZE_MAX);
+
+	guard(preempt)();
+
+	info = get_flush_tlb_info(NULL, start, end, PMD_SHIFT, true,
+				  TLB_GENERATION_INVALID);
+	kernel_tlb_flush_pgtable(info);
+	put_flush_tlb_info();
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().

---
base-commit: b1456f6dc167f7f101746e495bede2bac3d0e19f
change-id: 20250528-x86-fix-vmap-pt-del-flush-f9fa4f287c93

-- 
Jann Horn <jannh@...gle.com>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ