linux-kernel - Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250619090542.29974-1-lizhe.67@bytedance.com>
Date: Thu, 19 Jun 2025 17:05:42 +0800
From: lizhe.67@...edance.com
To: jgg@...pe.ca,
	david@...hat.com
Cc: akpm@...ux-foundation.org,
	alex.williamson@...hat.com,
	kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	lizhe.67@...edance.com,
	peterx@...hat.com
Subject: Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()

On Wed, 18 Jun 2025 10:23:50 -0300, jgg@...pe.ca wrote:
 
> On Wed, Jun 18, 2025 at 08:19:28PM +0800, lizhe.67@...edance.com wrote:
> > On Wed, 18 Jun 2025 08:56:22 -0300, jgg@...pe.ca wrote:
> >  
> > > On Wed, Jun 18, 2025 at 01:52:37PM +0200, David Hildenbrand wrote:
> > > 
> > > > I thought we also wanted to optimize out the
> > > > is_invalid_reserved_pfn() check for each subpage of a folio.
> > 
> > Yes, that is an important aspect of our optimization.
> > 
> > > VFIO keeps a tracking structure for the ranges, you can record there
> > > if a reserved PFN was ever placed into this range and skip the check
> > > entirely.
> > > 
> > > It would be very rare for reserved PFNs and non reserved will to be
> > > mixed within the same range, userspace could cause this but nothing
> > > should.
> > 
> > Yes, but it seems we don't have a very straightforward interface to
> > obtain the reserved attribute of this large range of pfns.
> 
> vfio_unmap_unpin()  has the struct vfio_dma, you'd store the
> indication there and pass it down.
> 
> It already builds the longest run of physical contiguity here:
> 
> 		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
> 			next = iommu_iova_to_phys(domain->domain, iova + len);
> 			if (next != phys + len)
> 				break;
> 		}
> 
> And we pass down a physically contiguous range to
> unmap_unpin_fast()/unmap_unpin_slow().
> 
> The only thing you need to do is to detect reserved in
> vfio_unmap_unpin() optimized flag in the dma, and break up the above
> loop if it crosses a reserved boundary.
> 
> If you have a reserved range then just directly call iommu_unmap and
> forget about any page pinning.
> 
> Then in the page pinning side you use the range version.
> 
> Something very approximately like the below. But again, I would
> implore you to just use iommufd that is already much better here.

Thank you for your suggestion. We are also working on this, but
it is not something that can be completed in a short time. In
the near term, we are still expected to use the type1 method.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597e..097b97c67e3f0d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -738,12 +738,13 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>  	long unlocked = 0, locked = 0;
>  	long i;
>  
> +	/* The caller has already ensured the pfn range is not reserved */
> +	unpin_user_page_range_dirty_lock(pfn_to_page(pfn), npage,
> +					 dma->prot & IOMMU_WRITE);
>  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> -		if (put_pfn(pfn++, dma->prot)) {
>  			unlocked++;
>  			if (vfio_find_vpfn(dma, iova))
>  				locked++;
> -		}
>  	}
>  
>  	if (do_accounting)
> @@ -1082,6 +1083,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	while (iova < end) {
>  		size_t unmapped, len;
>  		phys_addr_t phys, next;
> +		bool reserved = false;
>  
>  		phys = iommu_iova_to_phys(domain->domain, iova);
>  		if (WARN_ON(!phys)) {
> @@ -1089,6 +1091,9 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			continue;
>  		}
>  
> +		if (dma->has_reserved)
> +			reserved = is_invalid_reserved_pfn(phys >> PAGE_SHIFT);
> +
>  		/*
>  		 * To optimize for fewer iommu_unmap() calls, each of which
>  		 * may require hardware cache flushing, try to find the
> @@ -1098,21 +1103,31 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			next = iommu_iova_to_phys(domain->domain, iova + len);
>  			if (next != phys + len)
>  				break;
> +			if (dma->has_reserved &&
> +			    reserved != is_invalid_reserved_pfn(next >> PAGE_SHIFT))
> +				break;
>  		}
>  
>  		/*
>  		 * First, try to use fast unmap/unpin. In case of failure,
>  		 * switch to slow unmap/unpin path.
>  		 */
> -		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
> -					    &unlocked, &unmapped_region_list,
> -					    &unmapped_region_cnt,
> -					    &iotlb_gather);
> -		if (!unmapped) {
> -			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
> -						    phys, &unlocked);
> -			if (WARN_ON(!unmapped))
> -				break;
> +		if (reserved) {
> +			unmapped = iommu_unmap(domain->domain, iova, len);
> +			*iova += unmapped;
> +		} else {
> +			unmapped = unmap_unpin_fast(domain, dma, &iova, len,
> +						    phys, &unlocked,
> +						    &unmapped_region_list,
> +						    &unmapped_region_cnt,
> +						    &iotlb_gather);
> +			if (!unmapped) {
> +				unmapped = unmap_unpin_slow(domain, dma, &iova,
> +							    len, phys,
> +							    &unlocked);
> +				if (WARN_ON(!unmapped))
> +					break;
> +			}
>  		}
>  	}

As I understand it, there seem to be some issues with this
implementation. How can we obtain the value of dma->has_reserved
(acquiring it within vfio_pin_pages_remote() might be a good option)
and ensure that this value remains unchanged from the time of
assignment until we perform the unpin operation? I've searched
through the code and it appears that there are instances where
SetPageReserved() is called outside of the initialization phase.
Please correct me if I am wrong.

Thanks,
Zhe