[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250605100958.10c885d3.alex.williamson@redhat.com>
Date: Thu, 5 Jun 2025 10:09:58 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: lizhe.67@...edance.com
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large
folio
On Thu, 5 Jun 2025 20:49:23 +0800
lizhe.67@...edance.com wrote:
> From: Li Zhe <lizhe.67@...edance.com>
>
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
>
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
>
> This patch optimize this process by batching the put_pfn() operations.
>
> The performance test results, based on v6.15, for completing the 8G VFIO
> IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> case, the 8G virtual address space has been separately mapped to small
> folio and physical memory using hugetlbfs with pagesize=2M. For large
> folio, we achieve an approximate 66% performance improvement. However,
> for small folios, there is an approximate 11% performance degradation.
>
> Before this patch:
>
> hugetlbfs with pagesize=2M:
> funcgraph_entry: # 94413.092 us | vfio_unmap_unpin();
>
> small folio:
> funcgraph_entry: # 118273.331 us | vfio_unmap_unpin();
>
> After this patch:
>
> hugetlbfs with pagesize=2M:
> funcgraph_entry: # 31260.124 us | vfio_unmap_unpin();
>
> small folio:
> funcgraph_entry: # 131945.796 us | vfio_unmap_unpin();
I was just playing with a unit test[1] to validate your previous patch
and added this as well:
Test options:
vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
I'm running it once with device and size for the madvise and populate
tests, then again adding /dev/hugepages (1G) for the remaining test:
Base:
# vfio-pci-mem-dma-map 0000:0b:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.726 s (22.0 GB/s)
VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.071 s (224.0 GB/s)
VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.296 s (54.0 GB/s)
VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.741 s (21.6 GB/s)
VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.010 s (1542.9 GB/s)
VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
Map + Unmap patches:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.301 s (53.2 GB/s)
VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.735 s (21.8 GB/s)
VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.011 s (1434.7 GB/s)
VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
So overall the map optimization shows a nice improvement in hugetlbfs
mapping performance. I was hoping we'd see some improvement in THP,
but that doesn't appear to be the case. Will folio_nr_pages() ever be
more than 1 for THP? The degradation in non-hugetlbfs case is small,
but notable.
The unmap optimization shows a pretty substantial decline in the
non-hugetlbfs cases. I don't think that can be overlooked. Thanks,
Alex
[1]https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
Powered by blists - more mailing lists