[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250610031013.98556-1-lizhe.67@bytedance.com>
Date: Tue, 10 Jun 2025 11:10:13 +0800
From: lizhe.67@...edance.com
To: alex.williamson@...hat.com
Cc: kvm@...r.kernel.org,
linux-kernel@...r.kernel.org,
lizhe.67@...edance.com
Subject: Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
On Thu, 5 Jun 2025 10:09:58 -0600, alex.williamson@...hat.com wrote:
> On Thu, 5 Jun 2025 20:49:23 +0800
> lizhe.67@...edance.com wrote:
>
> > From: Li Zhe <lizhe.67@...edance.com>
> >
> > This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> > for large folios'[1].
> >
> > When vfio_unpin_pages_remote() is called with a range of addresses that
> > includes large folios, the function currently performs individual
> > put_pfn() operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> >
> > This patch optimize this process by batching the put_pfn() operations.
> >
> > The performance test results, based on v6.15, for completing the 8G VFIO
> > IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> > case, the 8G virtual address space has been separately mapped to small
> > folio and physical memory using hugetlbfs with pagesize=2M. For large
> > folio, we achieve an approximate 66% performance improvement. However,
> > for small folios, there is an approximate 11% performance degradation.
> >
> > Before this patch:
> >
> > hugetlbfs with pagesize=2M:
> > funcgraph_entry: # 94413.092 us | vfio_unmap_unpin();
> >
> > small folio:
> > funcgraph_entry: # 118273.331 us | vfio_unmap_unpin();
> >
> > After this patch:
> >
> > hugetlbfs with pagesize=2M:
> > funcgraph_entry: # 31260.124 us | vfio_unmap_unpin();
> >
> > small folio:
> > funcgraph_entry: # 131945.796 us | vfio_unmap_unpin();
>
> I was just playing with a unit test[1] to validate your previous patch
> and added this as well:
>
> Test options:
>
> vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
>
> I'm running it once with device and size for the madvise and populate
> tests, then again adding /dev/hugepages (1G) for the remaining test:
>
> Base:
> # vfio-pci-mem-dma-map 0000:0b:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.294 s (54.4 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.726 s (22.0 GB/s)
> VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.071 s (224.0 GB/s)
> VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
>
> Map patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.296 s (54.0 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.741 s (21.6 GB/s)
> VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.010 s (1542.9 GB/s)
> VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
>
> Map + Unmap patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.301 s (53.2 GB/s)
> VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.735 s (21.8 GB/s)
> VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.011 s (1434.7 GB/s)
> VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
>
> So overall the map optimization shows a nice improvement in hugetlbfs
> mapping performance. I was hoping we'd see some improvement in THP,
> but that doesn't appear to be the case. Will folio_nr_pages() ever be
> more than 1 for THP? The degradation in non-hugetlbfs case is small,
> but notable.
After I made the following modifications to the unit test, the
performance test results met the expectations.
diff --git a/vfio-pci-mem-dma-map.c b/vfio-pci-mem-dma-map.c
index 6fd3e83..58ea363 100644
--- a/vfio-pci-mem-dma-map.c
+++ b/vfio-pci-mem-dma-map.c
@@ -40,7 +40,7 @@ int main(int argc, char **argv)
{
int container, device, ret, huge_fd = -1, pgsize = getpagesize(), i;
int prot = PROT_READ | PROT_WRITE;
- int flags = MAP_SHARED | MAP_ANONYMOUS;
+ int flags = MAP_PRIVATE | MAP_ANONYMOUS;
char mempath[PATH_MAX] = "";
unsigned long size_gb, j, map_total, unmap_total, start, elapsed;
float secs;
@@ -131,7 +131,7 @@ int main(int argc, char **argv)
start = now_nsec();
for (j = 0, ptr = map; j < size_gb << 30; j += pgsize)
- (void)ptr[j];
+ ptr[j] = 1;
elapsed = now_nsec() - start;
secs = (float)elapsed / NSEC_PER_SEC;
fprintf(stderr, "%d: mmap populated in %.3fs\n", i, secs);
It seems that we need to use MAP_PRIVATE in this unit test to utilize
THP, rather than MAP_SHARED. My understanding is that for MAP_SHARED,
we call the function shmem_zero_setup() to map anonymous pages with
"dev/zero." In the case of MAP_PRIVATE, we directly call the function
vma_set_anonymous() (as referenced in the function __mmap_new_vma()).
Since the vm_ops for "dev/zero" does not implement the (*huge_fault)()
callback, this effectively precludes the use of THP.
In addition, the expression (void)ptr[j] might be ignored by the
compiler. It seems like a better strategy to simply assign a value
to it.
After making this modification to the unit test, there is almost no
difference in performance between the THP scenario and the hugetlbfs
scenario.
Base(v6.15):
#./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (331.3 GB/s)
VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.281 s (57.0 GB/s)
VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.053 s (301.2 GB/s)
VFIO UNMAP DMA in 0.139 s (115.2 GB/s)
Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (581.7 GB/s)
VFIO UNMAP DMA in 0.138 s (115.5 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.288 s (55.5 GB/s)
VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (496.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.4 GB/s)
> The unmap optimization shows a pretty substantial decline in the
> non-hugetlbfs cases. I don't think that can be overlooked. Thanks,
Yes, the performance in the MAP_POPULATE scenario will experience
a significant drop. I've recently come up with a better idea to
address this performance issue. I will send the v2 patch later.
Thanks,
Zhe
Powered by blists - more mailing lists