linux-kernel - Re: [PATCH] vfio/type1: optimize vfio_pin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250516081732.06ef2230.alex.williamson@redhat.com>
Date: Fri, 16 May 2025 08:17:32 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: lizhe.67@...edance.com
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 muchun.song@...ux.dev, peterx@...hat.com
Subject: Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for
 hugetlbfs folio

On Fri, 16 May 2025 16:16:16 +0800
lizhe.67@...edance.com wrote:

> On Thu, 15 May 2025 15:19:46 -0600, alex.williamson@...hat.com wrote:
> 
> >> +/*
> >> + * Find a random vfio_pfn that belongs to the range
> >> + * [iova, iova + PAGE_SIZE * npage)
> >> + */
> >> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> >> +		dma_addr_t iova, unsigned long npage)
> >> +{
> >> +	struct vfio_pfn *vpfn;
> >> +	struct rb_node *node = dma->pfn_list.rb_node;
> >> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> >> +
> >> +	while (node) {
> >> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> >> +
> >> +		if (end_iova <= vpfn->iova)
> >> +			node = node->rb_left;
> >> +		else if (iova > vpfn->iova)
> >> +			node = node->rb_right;
> >> +		else
> >> +			return vpfn;
> >> +	}
> >> +	return NULL;
> >> +}  
> >
> >This essentially duplicates vfio_find_vpfn(), where the existing
> >function only finds a single page.  The existing function should be
> >extended for this new use case and callers updated.  Also the vfio_pfn  
> 
> How about implementing it in this way?
> 
> static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> {
> 	return vfio_find_vpfn_range(dma, iova, 1);
> }
> 
> With this implement, the caller who calls vfio_find_vpfn() won't need to
> be modified.

Yes, that would minimize churn elsewhere.

> >is not "random", it's the first vfio_pfn overlapping the range.  
> 
> Yes you are right. I will modify the comments to "Find the first vfio_pfn
> overlapping the range [iova, iova + PAGE_SIZE * npage) in rb tree" in v2.
> 
> >> +
> >>  static void vfio_link_pfn(struct vfio_dma *dma,
> >>  			  struct vfio_pfn *new)
> >>  {
> >> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >>  				iova += (PAGE_SIZE * ret);
> >>  				continue;
> >>  			}
> >> +  
> >
> >Spurious new blank line.
> >  
> >> +		}  
> >
> >A new blank line here would be appreciated.  
> 
> Thank you. I will address this in v2.
> 
> >> +		/* Handle hugetlbfs page */
> >> +		if (likely(!disable_hugepages) &&  
> >
> >Isn't this already accounted for with npage = 1?  
> 
> Yes. This check is not necessary. I will remove it in v2.
> 
> >> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {  
> >
> >I don't follow how this guarantees the entire batch->size is
> >contiguous.  Isn't it possible that a batch could contain multiple
> >hugetlb folios?  Is the assumption here only true if folio_nr_pages()
> >(or specifically the pages remaining) is >= batch->capacity?  What  
> 
> Yes.
> 
> >happens if we try to map the last half of one 2MB hugetlb page and
> >first half of the non-physically-contiguous next page?  Or what if the
> >hugetlb size is 64KB and the batch contains multiple folios that are
> >not physically contiguous?  
> 
> Sorry for my problematic implementation. I will fix it in v2.
> 
> >> +			if (pfn != *pfn_base + pinned)
> >> +				goto out;
> >> +
> >> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
> >> +				if (!dma->lock_cap &&
> >> +				    mm->locked_vm + lock_acct + batch->size > limit) {
> >> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +						__func__, limit << PAGE_SHIFT);
> >> +					ret = -ENOMEM;
> >> +					goto unpin_out;
> >> +				}
> >> +				pinned += batch->size;
> >> +				npage -= batch->size;
> >> +				vaddr += PAGE_SIZE * batch->size;
> >> +				iova += PAGE_SIZE * batch->size;
> >> +				lock_acct += batch->size;
> >> +				batch->offset += batch->size;
> >> +				batch->size = 0;
> >> +				continue;
> >> +			}  
> >
> >There's a lot of duplication with the existing page-iterative loop.  I
> >think they could be consolidated if we extract the number of known
> >contiguous pages based on the folio into a variable, default 1.  
> 
> Good idea! I will try to implement this optimization based on this idea
> in v2.
> 
> >Also, while this approach is an improvement, it leaves a lot on the
> >table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
> >example we're at best incrementing 1GB hugetlb pages in 2MB increments.
> >We're also wasting a lot of cycles to fill pages points we mostly don't
> >use.  Thanks,  
> 
> Yes, this is the ideal case. However, we are uncertain whether the pages
> to be pinned are hugetlbfs folio, and increasing batch->capacity would
> lead to a significant increase in memory usage. Additionally, since
> pin_user_pages_remote() currently lacks a variant that can reduce the
> overhead of "filling pages points" of hugetlbfs folio (maybe not correct).
> I suggest we proceeding with additional optimizations after further
> infrastructure improvements.

That's fair, it's a nice improvement within the existing interfaces
even if we choose to pursue something more aggressive in the future.
Please consider documenting improvements relative to 1GB hugetlb in the
next version also.  Thanks,

Alex