linux-kernel - Re: [PATCH RFC v2 3/4] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aIAJfYMKYKyZZRqx@Asurada-Nvidia>
Date: Tue, 22 Jul 2025 14:58:21 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: <joro@...tes.org>, <will@...nel.org>, <robin.murphy@....com>,
	<rafael@...nel.org>, <lenb@...nel.org>, <bhelgaas@...gle.com>,
	<iommu@...ts.linux.dev>, <linux-kernel@...r.kernel.org>,
	<linux-acpi@...r.kernel.org>, <linux-pci@...r.kernel.org>,
	<patches@...ts.linux.dev>, <pjaroszynski@...dia.com>, <vsethi@...dia.com>,
	<helgaas@...nel.org>, <baolu.lu@...ux.intel.com>
Subject: Re: [PATCH RFC v2 3/4] iommu: Introduce iommu_dev_reset_prepare()
 and iommu_dev_reset_done()

Sorry for a huge delay. I've addressed all, following your remarks.

Some feedbacks inline.

On Fri, Jul 04, 2025 at 12:43:42PM -0300, Jason Gunthorpe wrote:
> On Sat, Jun 28, 2025 at 12:42:41AM -0700, Nicolin Chen wrote:
> 
> >  - This only works for IOMMU drivers that implemented ops->blocked_domain
> >    correctly with pci_disable_ats().
> 
> As was in the thread, it works for everyone. Even if we install an
> empty paging domain for blocking that still will stop the ATS
> invalidations from being issued. ATS remains on but this is not a
> problem.

OK. And I am dropping this validation in the PCI patch:

	/* Something wrong with the iommu driver that failed to disable ATS */
	if (dev->ats_enabled)
		pci_err(dev, "failed to stop ATS. ATS invalidation may time out\n");

> > @@ -2155,8 +2172,17 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
> >  	int ret = 0;
> >  
> >  	mutex_lock(&group->mutex);
> > +
> > +	/*
> > +	 * There is a racy attach while the device is resetting. Defer it until
> > +	 * the iommu_dev_reset_done() that attaches the device to group->domain.
> > +	 */
> > +	if (device_to_group_device(dev)->pending_reset)
> > +		goto unlock;
> > +
> >  	if (dev->iommu && dev->iommu->attach_deferred)
> >  		ret = __iommu_attach_device(domain, dev);
> > +unlock:
> >  	mutex_unlock(&group->mutex);
> 
> Actually looking at this some more maybe write it like:
> 
> /*
>  * This is called on the dma mapping fast path so avoid locking. This
>  * is racy, but we have an expectation that the driver will setup its
>  * DMAs inside probe while still single threaded to avoid racing.
>  */
> if (dev->iommu && !READ_ONCE(dev->iommu->attach_deferred))

This triggers a build error as attach_deferred is a bit-field. So I
am changing it from "u32 attach_deferred:1" to "bool" for this.

And, to keep the original logic, I think it should be:
	if (!dev->iommu || !READ_ONCE(dev->iommu->attach_deferred))

>    return 0;
> 
> guard(mutex)(&group->mutex);

I recall Baolu mentioned that Joerg might not like the guard style
so I am keeping mutex_lock/unlock().

> if (device_to_group_device(dev)->pending_reset)
>     return 0;
> 
> if (!dev->iommu->attach_deferred)
>    return 0;

I think this is redundant since the fast path checked.

> return __iommu_attach_device(domain, dev);
> 
> And of course it is already quite crazy to be doing FLR during a
> device probe so this is not a realistic scenario.

Hmm, I am not sure about that, as I see iommu_deferred_attach() get
mostly invoked by a dma_alloc() or even a dma_map(). So, this might
not be confined to a device probe?

> > +	if (dev->iommu->require_direct) {
> > +		dev_warn(
> > +			dev,
> > +			"Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
> > +		return -EINVAL;
> > +	}
> 
> I don't think we can do this. eg on ARM all devices have RMRs inside
> VMs so this will completely break FLR inside a vm???
> 
> Either ignore this condition with the rational that we are about to
> reset it so it doesn't matter, or we need to establish a new paging
> domain for isolation purposes that has the RMR setup.

Ah, you are right. ARM MSI in a VM uses RMR and sets this.

But does it also raise a question that a VM having RMR can't use
the blocked_domain, as __iommu_device_set_domain() has the exact
same check rejecting blocked_domain? Not sure if there would be
some unintended consequnce though...

> > +	if (ret)
> > +		goto unlock;
> > +
> > +	/* Dock PASID domains to blocked_domain while retaining pasid_array */
> > +	xa_lock(&group->pasid_array);
> 
> Not sure we need this lock? The group mutex already prevents mutation
> of the xa list and I dont' think it is allowed to call
> iommu_remove_dev_pasid() in an atomic context.

I see only iommu_attach_handle_get() doesn't use group->mutex. And
it's a reader. So I think it's safe to drop the xa_lock.

I added this:

	/*                                                                                                                                                                                                                                                                                                                                              |||     iommu_map_sg
	 * Dock PASID domains to blocking_domain while retaining pasid_array.
	 *
	 * The pasid_array is mostly fenced by group->mutex, except one reader
	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
	 */

Thanks!
Nicolin