[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4153fb7131dda901b13a2e90654232fe059c8f09.1749494161.git.nicolinc@nvidia.com>
Date: Mon, 9 Jun 2025 11:45:13 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: <jgg@...dia.com>, <joro@...tes.org>, <will@...nel.org>,
<robin.murphy@....com>, <bhelgaas@...gle.com>
CC: <iommu@...ts.linux.dev>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>, <patches@...ts.linux.dev>,
<pjaroszynski@...dia.com>, <vsethi@...dia.com>
Subject: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
before initiating a Function Level Reset, and then ensure no invalidation
requests being issued to a device when its ATS capability is disabled.
Since pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver
while an unsolicited FLR can happen at any time in the PCI layer, PCI code
has to notify the IOMMU subsystem about the ongoing FLR. Add a pair of new
IOMMU APIs that will be called by the PCI reset functions before&after the
reset routines.
However, if there is a domain attachment/replacement happening during an
ongoing reset, the ATS might be re-enabled between the two function calls.
So the iommu_dev_reset_prepare() has to hold the group mutex to avoid this
race condition, unitl iommu_dev_reset_done() is finished. Thus, these two
functions are a strong pair that must be used together.
Inside the mutex, these two functions will dock all RID and PASID domains
to an IOMMU_DOMAIN_BLOCKED. This would further disable ATS by two-fold: an
IOMMU driver should disable ATS in its control bits (e.g. SMMU's STE.EATS)
and an IOMMU driver should call pci_disable_ats() as well.
Notes:
- This only works for IOMMU drivers that implemented ops->blocked_domain
correctly with pci_disable_ats().
- This only works for IOMMU drivers that will not issue ATS invalidation
requests to the device, after it's docked at ops->blocked_domain.
Driver should fix itself to align with the aforementioned notes.
Signed-off-by: Nicolin Chen <nicolinc@...dia.com>
---
include/linux/iommu.h | 12 +++++
drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..a17161b8625a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1123,6 +1123,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
extern struct mutex iommu_probe_device_lock;
int iommu_probe_device(struct device *dev);
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
int iommu_device_use_default_domain(struct device *dev);
void iommu_device_unuse_default_domain(struct device *dev);
@@ -1407,6 +1410,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
return -ENODEV;
}
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+ return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
{
return NULL;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a4b606c591da..3c1854c5e55e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3795,6 +3795,112 @@ int iommu_replace_group_handle(struct iommu_group *group,
}
EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+/*
+ * Deadlock Alert
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, as iommu_dev_reset_prepare() holds
+ * the group->mutex that will be only released in iommu_dev_reset_done().
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ const struct iommu_ops *ops;
+ unsigned long pasid;
+ void *entry;
+ int ret;
+
+ /* Before locking */
+ if (!dev_has_iommu(dev))
+ return 0;
+
+ if (dev->iommu->require_direct) {
+ dev_warn(dev,
+ "Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
+ return -EINVAL;
+ }
+
+ ops = dev_iommu_ops(dev);
+ if (!ops->blocked_domain) {
+ dev_warn(dev,
+ "IOMMU driver doesn't support IOMMU_DOMAIN_BLOCKED\n");
+ return -EOPNOTSUPP;
+ }
+
+ /*
+ * group->mutex starts
+ *
+ * This has to hold the group mutex until the reset is done, to prevent
+ * any RID or PASID domain attachment/replacement, which otherwise might
+ * re-enable the ATS during the reset cycle.
+ */
+ mutex_lock(&group->mutex);
+
+ /* Device is already attached to the blocked_domain. Nothing to do */
+ if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+ return 0;
+
+ /* Dock RID domain to blocked_domain while retaining group->domain */
+ ret = __iommu_attach_device(ops->blocked_domain, dev);
+ if (ret)
+ return ret;
+
+ /* Dock PASID domains to blocked_domain while retaining pasid_array */
+ xa_lock(&group->pasid_array);
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ iommu_remove_dev_pasid(dev, pasid,
+ pasid_array_entry_to_domain(entry));
+ xa_unlock(&group->pasid_array);
+
+ /* group->mutex is held. Caller must invoke iommu_dev_reset_done() */
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/*
+ * This is the resume routine of iommu_dev_reset_prepare(). It unlocks the group
+ * mutex at end, after all RID/PASID domains are re-attached.
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ const struct iommu_ops *ops;
+ unsigned long pasid;
+ void *entry;
+
+ /* Previously unlocked */
+ if (!dev_has_iommu(dev))
+ return;
+ ops = dev_iommu_ops(dev);
+ if (!ops->blocked_domain)
+ return;
+
+ /* group->mutex held in iommu_dev_reset_prepare() continues from here */
+ WARN_ON(!lockdep_is_held(&group->mutex));
+
+ if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+ goto unlock;
+
+ /* Shift RID domain back to group->domain */
+ WARN_ON(__iommu_attach_device(group->domain, dev));
+
+ /* Shift PASID domains back to domains retained in pasid_array */
+ xa_lock(&group->pasid_array);
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ WARN_ON(__iommu_set_group_pasid(
+ pasid_array_entry_to_domain(entry), group, pasid,
+ ops->blocked_domain));
+ xa_unlock(&group->pasid_array);
+
+unlock:
+ mutex_unlock(&group->mutex);
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
/**
* iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
--
2.43.0
Powered by blists - more mailing lists