linux-kernel - Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e049c100-2e54-4fd7-aadd-c181f9626f14@linux.intel.com>
Date: Tue, 15 Jul 2025 13:55:01 +0800
From: Baolu Lu <baolu.lu@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
 Robin Murphy <robin.murphy@....com>, Kevin Tian <kevin.tian@...el.com>,
 Jason Gunthorpe <jgg@...dia.com>, Jann Horn <jannh@...gle.com>,
 Vasant Hegde <vasant.hegde@....com>, Dave Hansen <dave.hansen@...el.com>,
 Alistair Popple <apopple@...dia.com>, Uladzislau Rezki <urezki@...il.com>,
 Jean-Philippe Brucker <jean-philippe@...aro.org>,
 Andy Lutomirski <luto@...nel.org>, "Tested-by : Yi Lai" <yi1.lai@...el.com>,
 iommu@...ts.linux.dev, security@...nel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On 7/11/25 16:32, Peter Zijlstra wrote:
> On Fri, Jul 11, 2025 at 11:00:06AM +0800, Baolu Lu wrote:
>> Hi Peter Z,
>>
>> On 7/10/25 21:54, Peter Zijlstra wrote:
>>> On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
>>>> The vmalloc() and vfree() functions manage virtually contiguous, but not
>>>> necessarily physically contiguous, kernel memory regions. When vfree()
>>>> unmaps such a region, it tears down the associated kernel page table
>>>> entries and frees the physical pages.
>>>>
>>>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>>>> shares and walks the CPU's page tables. Architectures like x86 share
>>>> static kernel address mappings across all user page tables, allowing the
>>>> IOMMU to access the kernel portion of these tables.
>>>>
>>>> Modern IOMMUs often cache page table entries to optimize walk performance,
>>>> even for intermediate page table levels. If kernel page table mappings are
>>>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>>>> entries, Use-After-Free (UAF) vulnerability condition arises. If these
>>>> freed page table pages are reallocated for a different purpose, potentially
>>>> by an attacker, the IOMMU could misinterpret the new data as valid page
>>>> table entries. This allows the IOMMU to walk into attacker-controlled
>>>> memory, leading to arbitrary physical memory DMA access or privilege
>>>> escalation.
>>>>
>>>> To mitigate this, introduce a new iommu interface to flush IOMMU caches
>>>> and fence pending page table walks when kernel page mappings are updated.
>>>> This interface should be invoked from architecture-specific code that
>>>> manages combined user and kernel page tables.
>>>
>>> I must say I liked the kPTI based idea better. Having to iterate and
>>> invalidate an unspecified number of IOMMUs from non-preemptible context
>>> seems 'unfortunate'.
>>
>> The cache invalidation path in IOMMU drivers is already critical and
>> operates within a non-preemptible context. This approach is, in fact,
>> already utilized for user-space page table updates since the beginning
>> of SVA support.
> 
> OK, fair enough I suppose. What kind of delays are we talking about
> here? The fact that you basically have a unbounded list of IOMMUs
> (although in practise I suppose it is limited by the amount of GPUs and
> other fancy stuff you can stick in your machine) does slightly worry me.

Yes, the mm (struct mm of processes that are bound to devices) list is
an unbounded list and can theoretically grow indefinitely. This results
in an unpredictable critical region.

I am considering whether this could be relaxed a bit if we manage the
IOMMU device list that is used for SVA. The number of IOMMU hardware
units in a system is limited and bounded, which might make the critical
region deterministic.

If that's reasonable, we can do it like this (compiled but not tested):

diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..9ed3be2ffaeb 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -9,6 +9,14 @@

  #include "iommu-priv.h"

+struct sva_iommu_device_item {
+	struct iommu_device *iommu;
+	unsigned int users;
+	struct list_head node;
+};
+
+static LIST_HEAD(sva_iommu_device_list);
+static DEFINE_SPINLOCK(sva_iommu_device_lock);
  static DEFINE_MUTEX(iommu_sva_lock);
  static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
  						   struct mm_struct *mm);
@@ -52,6 +60,71 @@ static struct iommu_mm_data 
*iommu_alloc_mm_data(struct mm_struct *mm, struct de
  	return iommu_mm;
  }

+static int iommu_sva_add_iommu_device(struct device *dev)
+{
+	struct iommu_device *iommu = dev->iommu->iommu_dev;
+	struct sva_iommu_device_item *iter;
+
+	struct sva_iommu_device_item *new __free(kfree) =
+		kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+	new->iommu = iommu;
+	new->users = 1;
+
+	guard(spinlock_irqsave)(&sva_iommu_device_lock);
+	list_for_each_entry(iter, &sva_iommu_device_list, node) {
+		if (iter->iommu == iommu) {
+			iter->users++;
+			return 0;
+		}
+	}
+	list_add(&no_free_ptr(new)->node, &sva_iommu_device_list);
+
+	return 0;
+}
+
+static void iommu_sva_remove_iommu_device(struct device *dev)
+{
+	struct iommu_device *iommu = dev->iommu->iommu_dev;
+	struct sva_iommu_device_item *iter, *tmp;
+
+	guard(spinlock_irqsave)(&sva_iommu_device_lock);
+	list_for_each_entry_safe(iter, tmp, &sva_iommu_device_list, node) {
+		if (iter->iommu != iommu)
+			continue;
+
+		if (--iter->users == 0) {
+			list_del(&iter->node);
+			kfree(iter);
+		}
+		break;
+	}
+}
+
+static int iommu_sva_attach_device(struct iommu_domain *domain, struct 
device *dev,
+				   ioasid_t pasid, struct iommu_attach_handle *handle)
+{
+	int ret;
+
+	ret = iommu_sva_add_iommu_device(dev);
+	if (ret)
+		return ret;
+
+	ret = iommu_attach_device_pasid(domain, dev, pasid, handle);
+	if (ret)
+		iommu_sva_remove_iommu_device(dev);
+
+	return ret;
+}
+
+static void iommu_sva_detach_device(struct iommu_domain *domain,
+				    struct device *dev, ioasid_t pasid)
+{
+	iommu_detach_device_pasid(domain, dev, pasid);
+	iommu_sva_remove_iommu_device(dev);
+}
+
  /**
   * iommu_sva_bind_device() - Bind a process address space to a device
   * @dev: the device
@@ -112,8 +185,7 @@ struct iommu_sva *iommu_sva_bind_device(struct 
device *dev, struct mm_struct *mm

  	/* Search for an existing domain. */
  	list_for_each_entry(domain, &mm->iommu_mm->sva_domains, next) {
-		ret = iommu_attach_device_pasid(domain, dev, iommu_mm->pasid,
-						&handle->handle);
+		ret = iommu_sva_attach_device(domain, dev, iommu_mm->pasid, 
&handle->handle);
  		if (!ret) {
  			domain->users++;
  			goto out;
@@ -127,8 +199,7 @@ struct iommu_sva *iommu_sva_bind_device(struct 
device *dev, struct mm_struct *mm
  		goto out_free_handle;
  	}

-	ret = iommu_attach_device_pasid(domain, dev, iommu_mm->pasid,
-					&handle->handle);
+	ret = iommu_sva_attach_device(domain, dev, iommu_mm->pasid, 
&handle->handle);
  	if (ret)
  		goto out_free_domain;
  	domain->users = 1;
@@ -170,7 +241,7 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
  		return;
  	}

-	iommu_detach_device_pasid(domain, dev, iommu_mm->pasid);
+	iommu_sva_detach_device(domain, dev, iommu_mm->pasid);
  	if (--domain->users == 0) {
  		list_del(&domain->next);
  		iommu_domain_free(domain);
@@ -312,3 +383,15 @@ static struct iommu_domain 
*iommu_sva_domain_alloc(struct device *dev,

  	return domain;
  }
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+	struct sva_iommu_device_item *item;
+
+	guard(spinlock_irqsave)(&sva_iommu_device_lock);
+	list_for_each_entry(item, &sva_iommu_device_list, node) {
+		if (!item->iommu->ops->paging_cache_invalidate)
+			continue;
+		item->iommu->ops->paging_cache_invalidate(item->iommu, start, end);
+	}
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..f3716200cc09 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -595,6 +595,8 @@ iommu_copy_struct_from_full_user_array(void *kdst, 
size_t kdst_entry_size,
   *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
   *		- IOMMU_DOMAIN_DMA: must use a dma domain
   *		- 0: use the default setting
+ * @paging_cache_invalidate: Invalidate paging structure caches that store
+ * 			     intermediate levels of the page table.
   * @default_domain_ops: the default ops for domains
   * @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU 
instance behind
   *                the @dev, as the set of virtualization resources 
shared/passed
@@ -654,6 +656,9 @@ struct iommu_ops {

  	int (*def_domain_type)(struct device *dev);

+	void (*paging_cache_invalidate)(struct iommu_device *dev,
+					unsigned long start, unsigned long end);
+
  	struct iommufd_viommu *(*viommu_alloc)(
  		struct device *dev, struct iommu_domain *parent_domain,
  		struct iommufd_ctx *ictx, unsigned int viommu_type);
@@ -1571,6 +1576,7 @@ struct iommu_sva *iommu_sva_bind_device(struct 
device *dev,
  					struct mm_struct *mm);
  void iommu_sva_unbind_device(struct iommu_sva *handle);
  u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long 
end);
  #else
  static inline struct iommu_sva *
  iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1595,6 +1601,7 @@ static inline u32 mm_get_enqcmd_pasid(struct 
mm_struct *mm)
  }

  static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, 
unsigned long end) {}
  #endif /* CONFIG_IOMMU_SVA */

  #ifdef CONFIG_IOMMU_IOPF
-- 
2.43.0

> At some point the low latency folks are going to come hunting you down.
> Do you have a plan on how to deal with this; or are we throwing up our
> hands an say, the hardware sucks, deal with it?
> 

Thanks,
baolu