linux-kernel - Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250624145346.GC150753@nvidia.com>
Date: Tue, 24 Jun 2025 11:53:46 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Xu Yilun <yilun.xu@...ux.intel.com>
Cc: kevin.tian@...el.com, will@...nel.org, aneesh.kumar@...nel.org,
	iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
	joro@...tes.org, robin.murphy@....com, shuah@...nel.org,
	nicolinc@...dia.com, aik@....com, dan.j.williams@...el.com,
	baolu.lu@...ux.intel.com, yilun.xu@...el.com
Subject: Re: [PATCH v2 3/4] iommufd: Destroy vdevice on idevice destroy

On Mon, Jun 23, 2025 at 05:49:45PM +0800, Xu Yilun wrote:
> +static void iommufd_device_remove_vdev(struct iommufd_device *idev)
> +{
> +	bool vdev_removing = false;
> +
> +	mutex_lock(&idev->igroup->lock);
> +	if (idev->vdev) {
> +		struct iommufd_vdevice *vdev;
> +
> +		vdev = iommufd_get_vdevice(idev->ictx, idev->vdev->obj.id);
> +		if (IS_ERR(vdev)) {

This incrs obj.users which will cause a concurrent
iommufd_object_remove() to fail with -EBUSY, which we are trying to
avoid.

Also you can hit a race where the tombstone has NULL'd the entry but
the racing destroy will then load the NULL with xas_load() and hit this:

		if (WARN_ON(obj != to_destroy)) {

So, this doesn't look like it will work right to me..

You want somewhat different destroy logic:

/*
 * The caller must directly obtain a shortterm_users reference without a users
 * reference using its own locking to protect the pointer. This function always
 * puts back the shortterm_users reference.
 */
int iommufd_object_remove_tombstone(struct iommufd_ctx *ictx,
				    struct iommufd_object *to_destroy)
{
	XA_STATE(xas, &ictx->objects, to_destroy->id);
	struct iommufd_object *obj;
	int ret;

	xa_lock(&ictx->objects);
	obj = xas_load(&xas);
	if (xa_is_zero(obj) || obj == NULL) {
		/*
		 * Another thread is racing to destroy this, since we have the
		 * shortterm_users refcount the other thread has xa_unlocked()
		 * but not passed iommufd_object_dec_wait_shortterm().
		 */
		if (refcount_dec_and_test(&to_destroy->shortterm_users))
			wake_up_interruptible_all(&ictx->destroy_wait);
		ret = 0;
		goto err_xa;
	} else if (WARN_ON(obj != to_destroy)) {
		refcount_dec(&obj->shortterm_users);
		ret = -ENOENT;
		goto err_xa;
	}

	/*
	 * The object is still in the xarray, so this thread will try to destroy
	 * it. Put back the callers shortterm_users.
	 */
	refcount_dec(&obj->shortterm_users);

	if (!refcount_dec_if_one(&obj->users)) {
		ret = -EBUSY;
		goto err_xa;
	}

	/* Leave behind a tombstone to prevent re-use of this entry */
	xas_store(&xas, XA_ZERO_ENTRY);
	xa_unlock(&ictx->objects);

	/*
	 * Since users is zero any positive users_shortterm must be racing
	 * iommufd_put_object(), or we have a bug.
	 */
	ret = iommufd_object_dec_wait_shortterm(ictx, obj);
	if (WARN_ON(ret))
		return ret;

	iommufd_object_ops[obj->type].destroy(obj);
	kfree(obj);
	return 0;

err_xa:
	xa_unlock(&ictx->objects);

	/* The returned object reference count is zero */
	return ret;
}

Then you'd call it by doing something like:

static void iommufd_device_remove_vdev(struct iommufd_device *idev)
{
	struct iommufd_object *to_destroy = NULL;
	int ret;

	mutex_lock(&idev->igroup->lock);
	if (!idev->vdev) {
		mutex_unlock(&idev->igroup->lock);
		return;
	}
	if (refcount_inc_not_zero(&idev->vdev->obj.shortterm_users))
		to_destroy = &idev->vdev->obj;
	mutex_unlock(&idev->igroup->lock);

	if (to_destroy) {
		ret = iommufd_object_remove_tombstone(idev->ictx, to_destroy);
		if (WARN_ON(ret))
			return;
	}

	/*
	 * We don't know what thread is actually going to destroy the vdev, but
	 * once the vdev is destroyed the pointer is NULL'd. At this
	 * point idev->users is 0 so no other thread can set a new vdev.
	 */
	if (!wait_event_timeout(idev->ictx->destroy_wait,
				!READ_ONCE(idev->vdev),
				msecs_to_jiffies(60000)))
		pr_crit("Time out waiting for iommufd vdevice removed\n");
}

Though there is a cleaner option here, you could do:

	mutex_lock(&idev->igroup->lock);
	if (idev->vdev)
		iommufd_vdevice_abort(&idev->vdev->obj);
	mutex_unlock(&idev->igroup->lock);

And make it safe to call abort twice, eg by setting dev to NULL and
checking for that. First thread to get to the igroup lock, either via
iommufd_vdevice_destroy() or via the above will do the actual abort
synchronously without any wait_event_timeout. That seems better??

> +	/* vdev can't outlive idev, vdev->idev is always valid, need no refcnt */
> +	vdev->idev = idev;

So this means a soon as 'idev->vdev = NULL;' happens idev is an
invalid pointer. Need a WRITE_ONCE there.

I would rephrase the comment as 
 iommufd_device_destroy() waits until idev->vdev is NULL before
 freeing the idev, which only happens once the vdev is finished
 destruction. Thus we do not need refcounting on either idev->vdev or
 vdev->idev.

and group both assignments together.

>  	vdev->ictx = ucmd->ictx;
>  	vdev->id = virt_id;
>  	vdev->dev = idev->dev;
>  	get_device(idev->dev);
>  	vdev->viommu = viommu;
>  	refcount_inc(&viommu->obj.users);
> +	/* idev->vdev is protected by idev->igroup->lock, need no refcnt */
> +	idev->vdev = vdev;

This can be WRITE_ONCE too

Jason