linux-kernel - Re: [PATCH v5 11/14] iommu/amd: Introduce gDomID-to-hDomID Mapping and handle parent domain invalidation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251119001151.GH120075@nvidia.com>
Date: Tue, 18 Nov 2025 20:11:51 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Cc: nicolinc@...dia.com, linux-kernel@...r.kernel.org, robin.murphy@....com,
	will@...nel.org, joro@...tes.org, kevin.tian@...el.com,
	jsnitsel@...hat.com, vasant.hegde@....com, iommu@...ts.linux.dev,
	santosh.shukla@....com, sairaj.arunkodilkar@....com,
	jon.grimm@....com, prashanthpra@...gle.com, wvw@...gle.com,
	wnliu@...gle.com, gptran@...gle.com, kpsingh@...gle.com,
	joao.m.martins@...cle.com, alejandro.j.jimenez@...cle.com
Subject: Re: [PATCH v5 11/14] iommu/amd: Introduce gDomID-to-hDomID Mapping
 and handle parent domain invalidation

On Wed, Nov 12, 2025 at 06:25:03PM +0000, Suravee Suthikulpanit wrote:
> Each nested domain is assigned guest domain ID (gDomID), which guest OS
> programs into guest Device Table Entry (gDTE). For each gDomID, the driver
> assigns a corresponding host domain ID (hDomID), which will be programmed
> into the host Device Table Entry (hDTE).
> 
> The hDomID is allocated during amd_iommu_alloc_domain_nested(),
> and free during nested_domain_free(). The gDomID-to-hDomID mapping info
> (struct guest_domain_mapping_info) is stored in a per-viommu xarray
> (struct amd_iommu_viommu.gdomid_array), which is indexed by gDomID.
> 
> Note also that parent domain can be shared among struct iommufd_viommu.
> Therefore, when hypervisor invalidates the nest parent domain, the AMD
> IOMMU command INVALIDATE_IOMMU_PAGES must be issued for each hDomID in
> the gdomid_array. This is handled by the iommu_flush_pages_v1_hdom_ids(),
> where it iterates through struct protection_domain.viommu_list.
> 
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@....com>
> ---
>  drivers/iommu/amd/amd_iommu_types.h | 23 +++++++++
>  drivers/iommu/amd/iommu.c           | 35 +++++++++++++
>  drivers/iommu/amd/iommufd.c         | 34 ++++++++++++
>  drivers/iommu/amd/nested.c          | 80 +++++++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)

I think this looks OK in general, just the locking needs fixing up.

> +static int iommu_flush_pages_v1_hdom_ids(struct protection_domain *pdom, u64 address, size_t size)
> +{
> +	int ret = 0;
> +	struct amd_iommu_viommu *aviommu;
> +
> +	list_for_each_entry(aviommu, &pdom->viommu_list, pdom_list) {
> +		unsigned long i;
> +		struct guest_domain_mapping_info *gdom_info;
> +		struct amd_iommu *iommu = container_of(aviommu->core.iommu_dev, struct amd_iommu, iommu);
> +
> +		xa_for_each(&aviommu->gdomid_array, i, gdom_info) {
> +			struct iommu_cmd cmd;

What is the locking for the xa here?

It looks missing too. Either hold the xa lock for this iteration, or
do something to hold the pdom lock when updating the xarray.

> +			pr_debug("%s: iommu=%#x, hdom_id=%#x\n", __func__,
> +				 iommu->devid, gdom_info->hdom_id);
> +			build_inv_iommu_pages(&cmd, address, size, gdom_info->hdom_id,
> +					      IOMMU_NO_PASID, false);
> +			ret |= iommu_queue_command(iommu, &cmd);
> +		}
> +	}
> +	return ret;
> +}

This is kind of painfully slow for invalidation but OK for now, we
don't really expect alot of parent invalidation traffic..

I think after we get this landed:

https://lore.kernel.org/r/cover.1762588839.git.nicolinc@nvidia.com

We should take a serious run at trying to make it shared code and have
AMD use it. That would resolve the performance concern..

> +static void amd_iommufd_viommu_destroy(struct iommufd_viommu *viommu)
> +{
> +	unsigned long flags;
> +	struct amd_iommu_viommu *entry, *next;
> +	struct amd_iommu_viommu *aviommu = container_of(viommu, struct amd_iommu_viommu, core);
> +	struct protection_domain *pdom = aviommu->parent;
> +
> +	spin_lock_irqsave(&pdom->lock, flags);
> +	list_for_each_entry_safe(entry, next, &pdom->viommu_list, pdom_list) {
> +		if (entry == aviommu)
> +			list_del(&entry->pdom_list);
> +	}
> +	spin_unlock_irqrestore(&pdom->lock, flags);

Same remark as Nicolin

>  static void nested_domain_free(struct iommu_domain *dom)
>  {
> +	struct guest_domain_mapping_info *curr;
>  	struct nested_domain *ndom = to_ndomain(dom);
> +	struct amd_iommu_viommu *aviommu = ndom->viommu;
> +
> +	if (!refcount_dec_and_test(&ndom->gdom_info->users))
> +		return;

Same locking issues here, you have to hold the xa_lock while doing
this decrement through to thecmpxchg otherwise it will go wrong.

> +	/*
> +	 * The refcount for the gdom_id to hdom_id mapping is zero.
> +	 * It is now safe to remove the mapping.
> +	 */
> +	curr = xa_cmpxchg(&aviommu->gdomid_array, ndom->gdom_id,
> +			  ndom->gdom_info, NULL, GFP_ATOMIC);
> +	if (curr) {
> +		if (xa_err(curr)) {
> +			pr_err("%s: Failed to free nested domain gdom_id=%#x\n",
> +			       __func__, ndom->gdom_id);

Don't log.. This should be a WARN_ON, the cmpxchg cannot fail unless
the xa is corrupted.

> +			return;
> +		}
> +
> +		/* success */
> +		pr_debug("%s: Free gdom_id=%#x, hdom_id=%#x\n",
> +			__func__, ndom->gdom_id, curr->hdom_id);
> +		kfree(curr);
> +	}
> +
> +	amd_iommu_pdom_id_free(ndom->gdom_info->hdom_id);

?? This should be before the kfree(curr). the hdom_id remains
allocated and valid so long as the gdom_info is allocated.

Jason