linux-kernel - Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFvDqAGAM3RbPh8G@yilunxu-OptiPlex-7050>
Date: Wed, 25 Jun 2025 17:38:48 +0800
From: Xu Yilun <yilun.xu@...ux.intel.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>
Cc: jgg@...dia.com, jgg@...pe.ca, kevin.tian@...el.com, will@...nel.org,
	iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
	joro@...tes.org, robin.murphy@....com, shuah@...nel.org,
	nicolinc@...dia.com, aik@....com, dan.j.williams@...el.com,
	baolu.lu@...ux.intel.com, yilun.xu@...el.com
Subject: Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy

On Wed, Jun 25, 2025 at 12:10:28PM +0530, Aneesh Kumar K.V wrote:
> Xu Yilun <yilun.xu@...ux.intel.com> writes:
> 
> > Destroy iommufd_vdevice(vdev) on iommufd_idevice(idev) destroy so that
> > vdev can't outlive idev.
> >
> > iommufd_device(idev) represents the physical device bound to iommufd,
> > while the iommufd_vdevice(vdev) represents the virtual instance of the
> > physical device in the VM. The lifecycle of the vdev should not be
> > longer than idev. This doesn't cause real problem on existing use cases
> > cause vdev doesn't impact the physical device, only provides
> > virtualization information. But to extend vdev for Confidential
> > Computing(CC), there are needs to do secure configuration for the vdev,
> > e.g. TSM Bind/Unbind. These configurations should be rolled back on idev
> > destroy, or the external driver(VFIO) functionality may be impact.
> >
> > Building the association between idev & vdev requires the two objects
> > pointing each other, but not referencing each other. This requires
> > proper locking. This is done by reviving some of Nicolin's patch [1].
> >
> > There are 3 cases on idev destroy:
> >
> >   1. vdev is already destroyed by userspace. No extra handling needed.
> >   2. vdev is still alive. Use iommufd_object_tombstone_user() to
> >      destroy vdev and tombstone the vdev ID.
> >   3. vdev is being destroyed by userspace. The vdev ID is already freed,
> >      but vdev destroy handler is not complete. The idev destroy handler
> >      should wait for vdev destroy completion.
> >
> > [1]: https://lore.kernel.org/all/53025c827c44d68edb6469bfd940a8e8bc6147a5.1729897278.git.nicolinc@nvidia.com/
> >
> > Original-by: Nicolin Chen <nicolinc@...dia.com>
> > Original-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@...nel.org>
> > Signed-off-by: Xu Yilun <yilun.xu@...ux.intel.com>
> 
> This is the latest version I have. But as Jason suggested we can
> possibly switch to short term users to avoid parallel destroy returning
> EBUSY.

We can discuss in that thread, I have the same question as Kevin.

> I am using a mutex lock to block parallel vdevice destroy.

I don't find reason to add a new lock rather than reuse the
idev->igroup->lock, which is already reviewed in Nicolin's series.

[...]

> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 3df468f64e7d..fd82140e6320 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -81,14 +81,16 @@ void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
>  struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
>  					  enum iommufd_object_type type)
>  {
> +	XA_STATE(xas, &ictx->objects, id);
>  	struct iommufd_object *obj;
>  
>  	if (iommufd_should_fail())
>  		return ERR_PTR(-ENOENT);
>  
>  	xa_lock(&ictx->objects);
> -	obj = xa_load(&ictx->objects, id);
> -	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> +	obj = xas_load(&xas);
> +	if (!obj || xa_is_zero(obj) ||

xas_load() & filter out zero entries, then what's the difference from
xa_load()?

> +	    (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
>  	    !iommufd_lock_obj(obj))
>  		obj = ERR_PTR(-ENOENT);
>  	xa_unlock(&ictx->objects);

[...]

>  static int iommufd_destroy(struct iommufd_ucmd *ucmd)
>  {
> +	int ret;
>  	struct iommu_destroy *cmd = ucmd->cmd;
> +	struct iommufd_object *obj;
> +	struct iommufd_device *idev = NULL;
> +
> +	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
> +	/* Destroying vdevice requires an idevice lock */
> +	if (!IS_ERR(obj) && obj->type == IOMMUFD_OBJ_VDEVICE) {
> +		struct iommufd_vdevice *vdev =
> +			container_of(obj, struct iommufd_vdevice, obj);
> +		/*
> +		 * since vdev have an refcount on idev, this is safe.
> +		 */
> +		idev = vdev->idev;
> +		mutex_lock(&idev->lock);
> +		/* drop the additonal reference taken above */
> +		iommufd_put_object(ucmd->ictx, obj);
> +	}
> +
> +	ret = iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0);
> +	if (idev)
> +		mutex_unlock(&idev->lock);

I'm trying best not to add vdev/idev specific things to generic
code. I also don't prefer adding a idev specific lock around generic
object remove flow. That makes idev/vdev way to special. So for these
locking things, I prefer more to Nicolin's v5 and revives them.

>  
> -	return iommufd_object_remove(ucmd->ictx, NULL, cmd->id, 0);
> +	return ret;
>  }
>  

[...]

> @@ -147,10 +160,17 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
>  	if (rc)
>  		goto out_abort;
>  	iommufd_object_finalize(ucmd->ictx, &vdev->obj);
> -	goto out_put_idev;
> +	/* don't allow idev free without vdev free */
> +	refcount_inc(&idev->obj.users);

IIRC, it has been suggested no long term refcount to kernel created
object. Besides, you actually disallow nothing.

> +	vdev->idev = idev;
> +	/* vdev lifecycle now managed by idev */
> +	idev->vdev = vdev;
> +	goto out_put_idev_unlock;
>  
>  out_abort:
>  	iommufd_object_abort_and_destroy(ucmd->ictx, &vdev->obj);
> +out_put_idev_unlock:
> +	mutex_unlock(&idev->lock);
>  out_put_idev:
>  	iommufd_put_object(ucmd->ictx, &idev->obj);
>  out_put_viommu:
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 6328c3a05bcd..0bf4f8b7f8d2 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -694,6 +694,12 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>  #if IS_ENABLED(CONFIG_EEH)
>  	eeh_dev_release(vdev->pdev);
>  #endif
> +
> +	/* destroy vdevice which involves tsm unbind before we disable pci disable */
> +	if (core_vdev->iommufd_device)
> +		iommufd_device_tombstone_vdevice(core_vdev->iommufd_device);

Ah.. I think all our effort to destroy vdevice on idevice destruction
is to hide the sequence details from VFIO.

> +
> +	/* tsm unbind should happen before this */
>  	vfio_pci_core_disable(vdev);

I did mention there is still issue when vdevice lifecycle problem is
solved. That VFIO for now do device disable configurations before
iommufd_device_unbind() and cause problem. But my quick idea would
be (not tested):

@@ -544,12 +544,14 @@ static void vfio_df_device_last_close(struct vfio_device_file *df)

        lockdep_assert_held(&device->dev_set->lock);

-       if (device->ops->close_device)
-               device->ops->close_device(device);
        if (iommufd)
                vfio_df_iommufd_unbind(df);
        else
                vfio_device_group_unuse_iommu(device);
+
+       if (device->ops->close_device)
+               device->ops->close_device(device);

>  
>  	mutex_lock(&vdev->igate);
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> index 34b6e6ca4bfa..8de3d903bfc0 100644
> --- a/include/linux/iommufd.h
> +++ b/include/linux/iommufd.h
> @@ -63,6 +63,7 @@ void iommufd_device_detach(struct iommufd_device *idev, ioasid_t pasid);
>  
>  struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
>  u32 iommufd_device_to_id(struct iommufd_device *idev);
> +void iommufd_device_tombstone_vdevice(struct iommufd_device *idev);
>  
>  struct iommufd_access_ops {
>  	u8 needs_pin_pages : 1;