linux-kernel - [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250905055103.3821518-7-baolu.lu@linux.intel.com>
Date: Fri,  5 Sep 2025 13:51:01 +0800
From: Lu Baolu <baolu.lu@...ux.intel.com>
To: Joerg Roedel <joro@...tes.org>,
	Will Deacon <will@...nel.org>,
	Robin Murphy <robin.murphy@....com>,
	Kevin Tian <kevin.tian@...el.com>,
	Jason Gunthorpe <jgg@...dia.com>,
	Jann Horn <jannh@...gle.com>,
	Vasant Hegde <vasant.hegde@....com>,
	Dave Hansen <dave.hansen@...el.com>,
	Alistair Popple <apopple@...dia.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Uladzislau Rezki <urezki@...il.com>,
	Jean-Philippe Brucker <jean-philippe@...aro.org>,
	Andy Lutomirski <luto@...nel.org>,
	Yi Lai <yi1.lai@...el.com>
Cc: iommu@...ts.linux.dev,
	security@...nel.org,
	linux-kernel@...r.kernel.org,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Lu Baolu <baolu.lu@...ux.intel.com>
Subject: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables

From: Dave Hansen <dave.hansen@...ux.intel.com>

On x86 and other architectures that map the kernel's virtual address space
into the upper portion of every process's page table, the IOMMU's paging
structure caches can become stale when the CPU page table is shared with
IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page
used for the kernel's page tables is freed and reused without the IOMMU
being notified.

While the IOMMU driver is notified of changes to user virtual address
mappings, there is no similar notification mechanism for kernel page
table changes. This can lead to data corruption or system instability
when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
caches may retain stale entries for kernel virtual addresses.

This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
of pages that are used as page tables for kernel address mappings. These
pages are now queued to a work struct instead of being freed immediately.

This deferred freeing provides a safe context for a future patch to add
an IOMMU-specific callback, which might be expensive on large-scale
systems. This ensures the necessary IOMMU cache invalidation is performed
before the page is finally returned to the page allocator outside of any
critical, non-sleepable path.

In the current kernel, some page table pages are allocated with an
associated struct ptdesc, while others are not. Those without a ptdesc are
freed using free_pages() and its variants, which bypasses the destructor
that pagetable_dtor_free() would run. While the long-term plan is to
convert all page table pages to use struct ptdesc, this uses a temporary
flag within ptdesc to indicate whether a page needs a destructor,
considering that this aims to fix a potential security issue in IOMMU SVA.
The flag and its associated logic can be removed once the conversion is
complete.

Signed-off-by: Dave Hansen <dave.hansen@...ux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@...ux.intel.com>
---
 include/linux/mm.h   | 16 +++++++++++++---
 mm/pgtable-generic.c | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 668d519edc0f..6d10715ecbb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,6 +2891,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
 	__free_pages(page, compound_order(page));
 }
 
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+void pagetable_free_async(struct ptdesc *pt);
+#else
+static inline void pagetable_free_async(struct ptdesc *pt)
+{
+	__pagetable_free(pt);
+}
+#endif
 /**
  * pagetable_free - Free pagetables
  * @pt:	The page table descriptor
@@ -2900,10 +2908,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
  */
 static inline void pagetable_free(struct ptdesc *pt)
 {
-	if (ptdesc_test_kernel(pt))
+	if (ptdesc_test_kernel(pt)) {
 		ptdesc_clear_kernel(pt);
-
-	__pagetable_free(pt);
+		pagetable_free_async(pt);
+	} else {
+		__pagetable_free(pt);
+	}
 }
 
 #if defined(CONFIG_SPLIT_PTE_PTLOCKS)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..cf884e8300ca 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,42 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(pte, ptl);
 	goto again;
 }
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+	struct list_head list;
+	/* protect above ptdesc lists */
+	spinlock_t lock;
+	struct work_struct work;
+} kernel_pgtable_work = {
+	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+	struct ptdesc *pt, *next;
+	LIST_HEAD(page_list);
+
+	spin_lock(&kernel_pgtable_work.lock);
+	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	list_for_each_entry_safe(pt, next, &page_list, pt_list) {
+		list_del(&pt->pt_list);
+		__pagetable_free(pt);
+	}
+}
+
+void pagetable_free_async(struct ptdesc *pt)
+{
+	spin_lock(&kernel_pgtable_work.lock);
+	list_add(&pt->pt_list, &kernel_pgtable_work.list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	schedule_work(&kernel_pgtable_work.work);
+}
+#endif
-- 
2.43.0