linux-kernel - Re: [PATCH 1/2] vfio: Fix unbalanced vfio_df

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250603175403.GA407344@nvidia.com>
Date: Tue, 3 Jun 2025 14:54:03 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: jacob.pan@...ux.microsoft.com,
	Saurabh Sengar <ssengar@...ux.microsoft.com>
Cc: Yi Liu <yi.l.liu@...el.com>, linux-kernel@...r.kernel.org,
	"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	Alex Williamson <alex.williamson@...hat.com>,
	Zhang Yu <zhangyu1@...rosoft.com>,
	Easwar Hariharan <eahariha@...ux.microsoft.com>
Subject: Re: [PATCH 1/2] vfio: Fix unbalanced vfio_df_close call in no-iommu
 mode

On Tue, Jun 03, 2025 at 09:25:49AM -0700, Jacob Pan wrote:

> Put this bug aside for now, I'm still unclear on why we do not allow
> bind for no-IOMMU devices. Per my understanding, no-IOMMU only means no
> translation. But since device still has been granted access, we should
> be able to allow binding device in no-IOMMU mode with IOMMU-FD
> context while simply disallowing IOAS attachment?

Because it isn't finished, the no-iommu mode under cdev should work
the same as the real iommu mode, complete with an ioas, hwpt and
idevice object.

There is a bunch of work still missing, so the uapi was blocked off
until it is done. Don't want to end up with uapi that doesn't
implement the plan and can't be fixed later..

> The reason I am asking is that I am working on enabling cdev with
> noiommu mode based on Yi's patch
> (https://lore.kernel.org/kvm/20230601082413.22a55ac4.alex.williamson@redhat.com/),
> it seems having iommufd implicit ownership model is key to enable PCI
> HOT RESET. 

The idea is that the IOAS and HWPT will allow iommufd to do the page
pinning exactly as normal-iommu would. So no mlock. A new ioctl would
let userspace retrieve the virtual to physical mapping, so no /proc/
mess.

I haven't thought too much about it, but I'm suspecting the easiest
thing is to make a "VFIO no-iommu iommu_domain" with special ops/etc
that can be put under a HWPT. The selftest has something like this
already.

Then when you bind VFIO would request a no-iommu mode and it wouuld
through the normal flow but create/find a no-iommu HWPT with the
special no-iommu iommu_domain.

Aside from that special logic everything else would be exactly the
same.

Add the virtual to physical ioctl and you are done.

> Our goal is to leverage persistent iommufd context for kexec
> handle off (KHO) usage, we currently have noiommu mode. This requires
> binding of device with iommufd ctx (can be marked as persistent) AFAIK,
> any suggestions?"

Then when you get here things can be much more similar to real KHO as
you could have a real page table (couresty of the iommupt patches)
with identical KHO semantics for no-iommu as real-iommu. Then it
becomes much less of a special case.

Something approximately like this

Jason

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 2111bad72c720f..7d171a7c34a287 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -241,6 +241,62 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, "IOMMUFD");
 
+struct iommufd_device *iommufd_device_bind_noiommu(struct iommufd_ctx *ictx,
+						   struct device *dev, u32 *id)
+{
+	struct iommufd_group *igroup;
+	struct iommufd_device *idev;
+	int rc;
+
+	/* FIXME better code sharing with iommufd_get_group() and iommufd_device_bind() */
+
+	igroup = kzalloc(sizeof(*igroup), GFP_KERNEL);
+	if (!igroup)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&igroup->ref);
+	mutex_init(&igroup->lock);
+	xa_init(&igroup->pasid_attach);
+	igroup->sw_msi_start = PHYS_ADDR_MAX;
+
+	// FIXME: need to check that igroup->group == NULL is OK for the no iommu paths.
+
+	rc = iommu_device_claim_dma_owner(dev, ictx);
+	if (rc && rc != -ENODEV)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->no_iommu = true;
+	idev->ictx = ictx;
+	iommufd_ctx_get(ictx);
+	idev->dev = dev;
+	/* The calling driver is a user until iommufd_device_unbind() */
+	refcount_inc(&idev->obj.users);
+	/* igroup refcount moves into iommufd_device */
+	idev->igroup = igroup;
+	mutex_init(&idev->iopf_lock);
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_device_release_dma_owner(dev);
+out_group_put:
+	iommufd_put_group(igroup);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_bind_noiommu, "IOMMUFD");
+
 /**
  * iommufd_ctx_has_group - True if any device within the group is bound
  *                         to the ictx
@@ -366,6 +422,9 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
 	struct iommufd_group *igroup = idev->igroup;
 	int rc;
 
+	if (idev->no_iommu)
+		return 0;
+
 	lockdep_assert_held(&igroup->lock);
 
 	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
@@ -432,6 +491,9 @@ static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
 	struct iommufd_attach_handle *handle;
 	int rc;
 
+	if (idev->no_iommu)
+		return 0;
+
 	rc = iommufd_hwpt_pasid_compat(hwpt, idev, pasid);
 	if (rc)
 		return rc;
@@ -486,6 +548,9 @@ static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
 {
 	struct iommufd_attach_handle *handle;
 
+	if (idev->no_iommu)
+		return;
+
 	handle = iommufd_device_get_attach_handle(idev, pasid);
 	if (pasid == IOMMU_NO_PASID)
 		iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
@@ -507,6 +572,9 @@ static int iommufd_hwpt_replace_device(struct iommufd_device *idev,
 	struct iommufd_attach_handle *handle, *old_handle;
 	int rc;
 
+	if (idev->no_iommu)
+		return 0;
+
 	rc = iommufd_hwpt_pasid_compat(hwpt, idev, pasid);
 	if (rc)
 		return rc;
@@ -560,6 +628,9 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	struct iommufd_attach *attach;
 	int rc;
 
+	if (hwpt_paging->no_iommu != idev->no_iommu)
+		return -EOPNOTSUPP;
+
 	mutex_lock(&igroup->lock);
 
 	attach = xa_cmpxchg(&igroup->pasid_attach, pasid, NULL,
@@ -703,6 +774,9 @@ iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup,
 	unsigned long index;
 	int rc;
 
+	if (hwpt_paging->no_iommu)
+		return 0;
+
 	lockdep_assert_held(&igroup->lock);
 
 	attach = xa_load(&igroup->pasid_attach, IOMMU_NO_PASID);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 487779470261a7..58102df9cdfeb0 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -145,7 +145,16 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	hwpt_paging->ioas = ioas;
 	hwpt_paging->nest_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
 
-	if (ops->domain_alloc_paging_flags) {
+	if (idev->no_iommu) {
+		hwpt->domain = iommufd_create_no_iommu_domain();
+		if (IS_ERR(hwpt->domain)) {
+			rc = PTR_ERR(hwpt->domain);
+			hwpt->domain = NULL;
+			goto out_abort;
+		}
+		immediate_attach = false;
+		hwpt_paging->no_iommu = true;
+	} else if (ops->domain_alloc_paging_flags) {
 		hwpt->domain = ops->domain_alloc_paging_flags(idev->dev,
 				flags & ~IOMMU_HWPT_FAULT_ID_VALID, user_data);
 		if (IS_ERR(hwpt->domain)) {
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 80e8c76d25f234..7a4a019ad7e5ab 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -308,6 +308,7 @@ struct iommufd_hwpt_paging {
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
 	bool nest_parent : 1;
+	bool no_iommu : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct iommufd_sw_msi_maps present_sw_msi;
@@ -425,6 +426,7 @@ struct iommufd_device {
 	/* always the physical device */
 	struct device *dev;
 	bool enforce_cache_coherency;
+	bool no_iommu;
 	/* protect iopf_enabled counter */
 	struct mutex iopf_lock;
 	unsigned int iopf_enabled;
diff --git a/drivers/iommu/iommufd/noiommu.c b/drivers/iommu/iommufd/noiommu.c
new file mode 100644
index 00000000000000..6b419a37f0713d
--- /dev/null
+++ b/drivers/iommu/iommufd/noiommu.c
@@ -0,0 +1,180 @@
+
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2025, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#include <linux/iommu.h>
+
+/* FIXME s/mock/noiommu/ */
+
+struct mock_iommu_domain {
+	unsigned long flags;
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum {
+	MOCK_DIRTY_TRACK = 1,
+	MOCK_HUGE_PAGE_SIZE = 512 * PAGE_SIZE,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
+	MOCK_PFN_HUGE_IOVA = _MOCK_PFN_START << 2,
+};
+
+static inline struct mock_iommu_domain *
+to_mock_domain(struct iommu_domain *domain)
+{
+	return container_of(domain, struct mock_iommu_domain, domain);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock = to_mock_domain(domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+	unsigned long start_iova = iova;
+
+	WARN_ON(iova % PAGE_SIZE);
+	WARN_ON(pgsize % PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			if (pgsize != PAGE_SIZE) {
+				flags |= MOCK_PFN_HUGE_IOVA;
+			}
+			old = xa_store(&mock->pfns, iova / PAGE_SIZE,
+				       xa_mk_value((paddr / PAGE_SIZE) |
+						   flags),
+				       gfp);
+			if (xa_is_err(old)) {
+				for (; start_iova != iova;
+				     start_iova += PAGE_SIZE)
+					xa_erase(&mock->pfns,
+						 start_iova /
+							 PAGE_SIZE);
+				return xa_err(old);
+			}
+			WARN_ON(old);
+			iova += PAGE_SIZE;
+			paddr += PAGE_SIZE;
+			*mapped += PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock = to_mock_domain(domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % PAGE_SIZE);
+	WARN_ON(pgsize % PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / PAGE_SIZE);
+
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every
+			 * starting/ending IOVA should have been an iova passed
+			 * to map.
+			 *
+			 * This simple logic doesn't work when the HUGE_PAGE is
+			 * turned on since the core code will automatically
+			 * switch between the two page sizes creating a break in
+			 * the unmap calls. The break can land in the middle of
+			 * contiguous IOVA.
+			 */
+			if (!(domain->pgsize_bitmap & MOCK_HUGE_PAGE_SIZE)) {
+				if (first) {
+					WARN_ON(ent && !(xa_to_value(ent) &
+							 MOCK_PFN_START_IOVA));
+					first = false;
+				}
+				if (pgcount == 1 &&
+				    cur + PAGE_SIZE == pgsize)
+					WARN_ON(ent && !(xa_to_value(ent) &
+							 MOCK_PFN_LAST_IOVA));
+			}
+
+			iova += PAGE_SIZE;
+			ret += PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock = to_mock_domain(domain);
+	void *ent;
+
+	WARN_ON(iova % PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * PAGE_SIZE;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock = to_mock_domain(domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static const struct iommu_domain_ops mock_domain_ops = {
+	.free = mock_domain_free,
+	.map_pages = mock_domain_map_pages,
+	.unmap_pages = mock_domain_unmap_pages,
+	.iova_to_phys = mock_domain_iova_to_phys,
+};
+
+static const struct iommu_ops no_iommu_owner;
+
+struct iommu_domain *iommufd_create_no_iommu_domain(void)
+{
+	struct mock_iommu_domain *mock;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return ERR_PTR(-ENOMEM);
+
+	mock->domain.geometry.aperture_start = 0;
+	mock->domain.geometry.aperture_end = ULONG_MAX;
+	mock->domain.pgsize_bitmap = PAGE_SIZE;
+	mock->domain.ops = &mock_domain_ops;
+	mock->domain.type = IOMMU_DOMAIN_UNMANAGED;
+	xa_init(&mock->pfns);
+	mock->domain.owner = &no_iommu_owner;
+
+	return &mock->domain;
+}