linux-kernel - Re: [PATCH RFC v2 3/4] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250704154342.GN1410929@nvidia.com>
Date: Fri, 4 Jul 2025 12:43:42 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Nicolin Chen <nicolinc@...dia.com>
Cc: joro@...tes.org, will@...nel.org, robin.murphy@....com,
	rafael@...nel.org, lenb@...nel.org, bhelgaas@...gle.com,
	iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
	linux-acpi@...r.kernel.org, linux-pci@...r.kernel.org,
	patches@...ts.linux.dev, pjaroszynski@...dia.com, vsethi@...dia.com,
	helgaas@...nel.org, baolu.lu@...ux.intel.com
Subject: Re: [PATCH RFC v2 3/4] iommu: Introduce iommu_dev_reset_prepare()
 and iommu_dev_reset_done()

On Sat, Jun 28, 2025 at 12:42:41AM -0700, Nicolin Chen wrote:

>  - This only works for IOMMU drivers that implemented ops->blocked_domain
>    correctly with pci_disable_ats().

As was in the thread, it works for everyone. Even if we install an
empty paging domain for blocking that still will stop the ATS
invalidations from being issued. ATS remains on but this is not a
problem.

>  - This only works for IOMMU drivers that will not issue ATS invalidation
>    requests to the device, after it's docked at ops->blocked_domain.

Which should be everyone.. It would be broken and racy with release to
do otherwise.

> @@ -2155,8 +2172,17 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
>  	int ret = 0;
>  
>  	mutex_lock(&group->mutex);
> +
> +	/*
> +	 * There is a racy attach while the device is resetting. Defer it until
> +	 * the iommu_dev_reset_done() that attaches the device to group->domain.
> +	 */
> +	if (device_to_group_device(dev)->pending_reset)
> +		goto unlock;
> +
>  	if (dev->iommu && dev->iommu->attach_deferred)
>  		ret = __iommu_attach_device(domain, dev);
> +unlock:
>  	mutex_unlock(&group->mutex);

Actually looking at this some more maybe write it like:

/*
 * This is called on the dma mapping fast path so avoid locking. This
 * is racy, but we have an expectation that the driver will setup its
 * DMAs inside probe while still single threaded to avoid racing.
 */
if (dev->iommu && !READ_ONCE(dev->iommu->attach_deferred))
   return 0;

guard(mutex)(&group->mutex);
if (device_to_group_device(dev)->pending_reset)
    return 0;

if (!dev->iommu->attach_deferred)
   return 0;
return __iommu_attach_device(domain, dev);

And of course it is already quite crazy to be doing FLR during a
device probe so this is not a realistic scenario.

> @@ -2295,6 +2321,13 @@ static int __iommu_device_set_domain(struct iommu_group *group,
>  		dev->iommu->attach_deferred = 0;
>  	}
>  
> +	/*
> +	 * There is a racy attach while the device is resetting. Defer it until
> +	 * the iommu_dev_reset_done() that attaches the device to group->domain.

"There is a concurrent attach" here and other places

> +int iommu_dev_reset_prepare(struct device *dev)
> +{
> +	const struct iommu_ops *ops;
> +	struct iommu_group *group;
> +	unsigned long pasid;
> +	void *entry;
> +	int ret = 0;
> +
> +	if (!dev_has_iommu(dev))
> +		return 0;
> +
> +	if (dev->iommu->require_direct) {
> +		dev_warn(
> +			dev,
> +			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
> +		return -EINVAL;
> +	}

I don't think we can do this. eg on ARM all devices have RMRs inside
VMs so this will completely break FLR inside a vm???

Either ignore this condition with the rational that we are about to
reset it so it doesn't matter, or we need to establish a new paging
domain for isolation purposes that has the RMR setup.

> +	/* group will be put in iommu_dev_reset_done() */
> +	group = iommu_group_get(dev);

Probably don't need this. If we are already requiring no
iommu_release_device() then we can keep with that.

> +	/* Caller ensures no racy iommu_release_device(), so this won't UAF */
> +	mutex_lock(&group->mutex);

It is the group_get above that won't UAF, this is fine once we have a
refcount.

> +	ops = dev_iommu_ops(dev);
> +	if (!ops->blocked_domain) {
> +		dev_warn(dev,
> +			 "IOMMU driver doesn't support IOMMU_DOMAIN_BLOCKED\n");
> +		ret = -EOPNOTSUPP;
> +		goto unlock;

Not necessary, just use the existing flow to allocate an empty paging
domain to group->blocking_domain

> +	device_to_group_device(dev)->pending_reset = true;
> +
> +	/* Device is already attached to the blocked_domain. Nothing to do */
> +	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
> +		goto unlock;

Should be group->domain == group->blocking_domain

> +	/* Dock RID domain to blocked_domain while retaining group->domain */
> +	ret = __iommu_attach_device(ops->blocked_domain, dev);

group->blocking_domain

> +	if (ret)
> +		goto unlock;
> +
> +	/* Dock PASID domains to blocked_domain while retaining pasid_array */
> +	xa_lock(&group->pasid_array);

Not sure we need this lock? The group mutex already prevents mutation
of the xa list and I dont' think it is allowed to call
iommu_remove_dev_pasid() in an atomic context.

Jason