linux-kernel - Re: [PATCH] vfio/type1: optimize vfio_pin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250516081616.73039-1-lizhe.67@bytedance.com>
Date: Fri, 16 May 2025 16:16:16 +0800
From: lizhe.67@...edance.com
To: alex.williamson@...hat.com
Cc: kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	lizhe.67@...edance.com,
	muchun.song@...ux.dev,
	peterx@...hat.com
Subject: Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

On Thu, 15 May 2025 15:19:46 -0600, alex.williamson@...hat.com wrote:

>> +/*
>> + * Find a random vfio_pfn that belongs to the range
>> + * [iova, iova + PAGE_SIZE * npage)
>> + */
>> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
>> +		dma_addr_t iova, unsigned long npage)
>> +{
>> +	struct vfio_pfn *vpfn;
>> +	struct rb_node *node = dma->pfn_list.rb_node;
>> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>> +
>> +	while (node) {
>> +		vpfn = rb_entry(node, struct vfio_pfn, node);
>> +
>> +		if (end_iova <= vpfn->iova)
>> +			node = node->rb_left;
>> +		else if (iova > vpfn->iova)
>> +			node = node->rb_right;
>> +		else
>> +			return vpfn;
>> +	}
>> +	return NULL;
>> +}
>
>This essentially duplicates vfio_find_vpfn(), where the existing
>function only finds a single page.  The existing function should be
>extended for this new use case and callers updated.  Also the vfio_pfn

How about implementing it in this way?

static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
{
	return vfio_find_vpfn_range(dma, iova, 1);
}

With this implement, the caller who calls vfio_find_vpfn() won't need to
be modified.

>is not "random", it's the first vfio_pfn overlapping the range.

Yes you are right. I will modify the comments to "Find the first vfio_pfn
overlapping the range [iova, iova + PAGE_SIZE * npage) in rb tree" in v2.

>> +
>>  static void vfio_link_pfn(struct vfio_dma *dma,
>>  			  struct vfio_pfn *new)
>>  {
>> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>  				iova += (PAGE_SIZE * ret);
>>  				continue;
>>  			}
>> +
>
>Spurious new blank line.
>
>> +		}
>
>A new blank line here would be appreciated.

Thank you. I will address this in v2.

>> +		/* Handle hugetlbfs page */
>> +		if (likely(!disable_hugepages) &&
>
>Isn't this already accounted for with npage = 1?

Yes. This check is not necessary. I will remove it in v2.

>> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {
>
>I don't follow how this guarantees the entire batch->size is
>contiguous.  Isn't it possible that a batch could contain multiple
>hugetlb folios?  Is the assumption here only true if folio_nr_pages()
>(or specifically the pages remaining) is >= batch->capacity?  What

Yes.

>happens if we try to map the last half of one 2MB hugetlb page and
>first half of the non-physically-contiguous next page?  Or what if the
>hugetlb size is 64KB and the batch contains multiple folios that are
>not physically contiguous?

Sorry for my problematic implementation. I will fix it in v2.

>> +			if (pfn != *pfn_base + pinned)
>> +				goto out;
>> +
>> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
>> +				if (!dma->lock_cap &&
>> +				    mm->locked_vm + lock_acct + batch->size > limit) {
>> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>> +						__func__, limit << PAGE_SHIFT);
>> +					ret = -ENOMEM;
>> +					goto unpin_out;
>> +				}
>> +				pinned += batch->size;
>> +				npage -= batch->size;
>> +				vaddr += PAGE_SIZE * batch->size;
>> +				iova += PAGE_SIZE * batch->size;
>> +				lock_acct += batch->size;
>> +				batch->offset += batch->size;
>> +				batch->size = 0;
>> +				continue;
>> +			}
>
>There's a lot of duplication with the existing page-iterative loop.  I
>think they could be consolidated if we extract the number of known
>contiguous pages based on the folio into a variable, default 1.

Good idea! I will try to implement this optimization based on this idea
in v2.

>Also, while this approach is an improvement, it leaves a lot on the
>table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
>example we're at best incrementing 1GB hugetlb pages in 2MB increments.
>We're also wasting a lot of cycles to fill pages points we mostly don't
>use.  Thanks,

Yes, this is the ideal case. However, we are uncertain whether the pages
to be pinned are hugetlbfs folio, and increasing batch->capacity would
lead to a significant increase in memory usage. Additionally, since
pin_user_pages_remote() currently lacks a variant that can reduce the
overhead of "filling pages points" of hugetlbfs folio (maybe not correct).
I suggest we proceeding with additional optimizations after further
infrastructure improvements.