linux-kernel - Re: [RFC] vfio/type1: optimize vfio_unpin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250610031013.98556-1-lizhe.67@bytedance.com>
Date: Tue, 10 Jun 2025 11:10:13 +0800
From: lizhe.67@...edance.com
To: alex.williamson@...hat.com
Cc: kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	lizhe.67@...edance.com
Subject: Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio

On Thu, 5 Jun 2025 10:09:58 -0600, alex.williamson@...hat.com wrote:

> On Thu,  5 Jun 2025 20:49:23 +0800
> lizhe.67@...edance.com wrote:
> 
> > From: Li Zhe <lizhe.67@...edance.com>
> > 
> > This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> > for large folios'[1].
> > 
> > When vfio_unpin_pages_remote() is called with a range of addresses that
> > includes large folios, the function currently performs individual
> > put_pfn() operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> > 
> > This patch optimize this process by batching the put_pfn() operations.
> > 
> > The performance test results, based on v6.15, for completing the 8G VFIO
> > IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> > case, the 8G virtual address space has been separately mapped to small
> > folio and physical memory using hugetlbfs with pagesize=2M. For large
> > folio, we achieve an approximate 66% performance improvement. However,
> > for small folios, there is an approximate 11% performance degradation.
> > 
> > Before this patch:
> > 
> >     hugetlbfs with pagesize=2M:
> >     funcgraph_entry:      # 94413.092 us |  vfio_unmap_unpin();
> > 
> >     small folio:
> >     funcgraph_entry:      # 118273.331 us |  vfio_unmap_unpin();
> > 
> > After this patch:
> > 
> >     hugetlbfs with pagesize=2M:
> >     funcgraph_entry:      # 31260.124 us |  vfio_unmap_unpin();
> > 
> >     small folio:
> >     funcgraph_entry:      # 131945.796 us |  vfio_unmap_unpin();
> 
> I was just playing with a unit test[1] to validate your previous patch
> and added this as well:
> 
> Test options:
> 
> 	vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
> 
> I'm running it once with device and size for the madvise and populate
> tests, then again adding /dev/hugepages (1G) for the remaining test:
> 
> Base:
> # vfio-pci-mem-dma-map 0000:0b:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.294 s (54.4 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.726 s (22.0 GB/s)
> VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.071 s (224.0 GB/s)
> VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
> 
> Map patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.296 s (54.0 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.741 s (21.6 GB/s)
> VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.010 s (1542.9 GB/s)
> VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
> 
> Map + Unmap patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.301 s (53.2 GB/s)
> VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.735 s (21.8 GB/s)
> VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.011 s (1434.7 GB/s)
> VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
> 
> So overall the map optimization shows a nice improvement in hugetlbfs
> mapping performance.  I was hoping we'd see some improvement in THP,
> but that doesn't appear to be the case.  Will folio_nr_pages() ever be
> more than 1 for THP?  The degradation in non-hugetlbfs case is small,
> but notable.

After I made the following modifications to the unit test, the
performance test results met the expectations.

diff --git a/vfio-pci-mem-dma-map.c b/vfio-pci-mem-dma-map.c
index 6fd3e83..58ea363 100644
--- a/vfio-pci-mem-dma-map.c
+++ b/vfio-pci-mem-dma-map.c
@@ -40,7 +40,7 @@ int main(int argc, char **argv)
 {
        int container, device, ret, huge_fd = -1, pgsize = getpagesize(), i;
        int prot = PROT_READ | PROT_WRITE;
-       int flags = MAP_SHARED | MAP_ANONYMOUS;
+       int flags = MAP_PRIVATE | MAP_ANONYMOUS;
        char mempath[PATH_MAX] = "";
        unsigned long size_gb, j, map_total, unmap_total, start, elapsed;
        float secs;
@@ -131,7 +131,7 @@ int main(int argc, char **argv)
        
                start = now_nsec();
                for (j = 0, ptr = map; j < size_gb << 30; j += pgsize)
-                       (void)ptr[j];
+                       ptr[j] = 1;
                elapsed = now_nsec() - start;
                secs = (float)elapsed / NSEC_PER_SEC;
                fprintf(stderr, "%d: mmap populated in %.3fs\n", i, secs);

It seems that we need to use MAP_PRIVATE in this unit test to utilize
THP, rather than MAP_SHARED. My understanding is that for MAP_SHARED,
we call the function shmem_zero_setup() to map anonymous pages with
"dev/zero." In the case of MAP_PRIVATE, we directly call the function
vma_set_anonymous() (as referenced in the function __mmap_new_vma()).
Since the vm_ops for "dev/zero" does not implement the (*huge_fault)()
callback, this effectively precludes the use of THP.

In addition, the expression (void)ptr[j] might be ignored by the
compiler. It seems like a better strategy to simply assign a value
to it.

After making this modification to the unit test, there is almost no
difference in performance between the THP scenario and the hugetlbfs
scenario.

Base(v6.15):
#./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (331.3 GB/s)
VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.281 s (57.0 GB/s)
VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.053 s (301.2 GB/s)
VFIO UNMAP DMA in 0.139 s (115.2 GB/s)

Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (581.7 GB/s)
VFIO UNMAP DMA in 0.138 s (115.5 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.288 s (55.5 GB/s)
VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (496.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.4 GB/s)

> The unmap optimization shows a pretty substantial decline in the
> non-hugetlbfs cases.  I don't think that can be overlooked.  Thanks,

Yes, the performance in the MAP_POPULATE scenario will experience
a significant drop. I've recently come up with a better idea to
address this performance issue. I will send the v2 patch later.

Thanks,
Zhe