linux-kernel - Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250522082524.75076-1-lizhe.67@bytedance.com>
Date: Thu, 22 May 2025 16:25:24 +0800
From: lizhe.67@...edance.com
To: david@...hat.com
Cc: alex.williamson@...hat.com,
	kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	lizhe.67@...edance.com,
	muchun.song@...ux.dev,
	peterx@...hat.com
Subject: Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio

On Thu, 22 May 2025 09:22:50 +0200, david@...hat.com wrote:

>On 22.05.25 05:49, lizhe.67@...edance.com wrote:
>> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@...hat.com wrote:
>> 
>>>> From: Li Zhe <lizhe.67@...edance.com>
>>>>
>>>> When vfio_pin_pages_remote() is called with a range of addresses that
>>>> includes large folios, the function currently performs individual
>>>> statistics counting operations for each page. This can lead to significant
>>>> performance overheads, especially when dealing with large ranges of pages.
>>>>
>>>> This patch optimize this process by batching the statistics counting
>>>> operations.
>>>>
>>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>>>> address space has been mapped to physical memory using hugetlbfs with
>>>> pagesize=2M.
>>>>
>>>> Before this patch:
>>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>>>>
>>>> After this patch:
>>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
>>>>
>>>> Signed-off-by: Li Zhe <lizhe.67@...edance.com>
>>>> Co-developed-by: Alex Williamson <alex.williamson@...hat.com>
>>>> Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
>>>> ---
>>>
>>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
>>> thread if there are further ideas how to salvage this.  Thanks,
>> 
>> How about considering the solution David mentioned to check whether the
>> pages or PFNs are actually consecutive?
>> 
>> I have conducted a preliminary attempt, and the performance testing
>> revealed that the time consumption is approximately 18,000 microseconds.
>> Compared to the previous 33,000 microseconds, this also represents a
>> significant improvement.
>> 
>> The modification is quite straightforward. The code below reflects the
>> changes I have made based on this patch.
>> 
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index bd46ed9361fe..1cc1f76d4020 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>>          return ret;
>>   }
>>   
>> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
>> +{
>> +       long i;
>> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
>> +
>> +       for (i = 1; i < npage; ++i) {
>> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
>> +                       break;
>> +               next_pfn++;
>> +       }
>> +       return i;
>> +}
>
>
>What might be faster is obtaining the folio, and then calculating the 
>next expected page pointer, comparing whether the page pointers match.
>
>Essentially, using folio_page() to calculate the expected next page.
>
>nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
>so that might be rather fast.
>
>
>So we'd obtain
>
>start_idx = folio_idx(folio, batch->pages[batch->offset]);

Do you mean using folio_page_idx()?

>and then check for
>
>batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)

Thank you for your reminder. This is indeed a better solution.
The updated code might look like this:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bd46ed9361fe..f9a11b1d8433 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
        return ret;
 }
 
+static inline long continuous_pages_num(struct folio *folio,
+               struct vfio_batch *batch, long npage)
+{
+       long i;
+       unsigned long start_idx =
+                       folio_page_idx(folio, batch->pages[batch->offset]);
+
+       for (i = 1; i < npage; ++i)
+               if (batch->pages[batch->offset + i] !=
+                               folio_page(folio, start_idx + i))
+                       break;
+       return i;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
                         */
                        nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
                                                folio_page_idx(folio, batch->pages[batch->offset]));
-                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
-                               nr_pages = 1;
+                       if (nr_pages > 1) {
+                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
+                                       nr_pages = 1;
+                               else
+                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
+                       }
 
                        /*
                         * Reserved pages aren't counted against the user,

Thanks,
Zhe