linux-kernel - Re: [RFC] vfio/type1: optimize vfio_unpin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250605100958.10c885d3.alex.williamson@redhat.com>
Date: Thu, 5 Jun 2025 10:09:58 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: lizhe.67@...edance.com
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large
 folio

On Thu,  5 Jun 2025 20:49:23 +0800
lizhe.67@...edance.com wrote:

> From: Li Zhe <lizhe.67@...edance.com>
> 
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
> 
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the put_pfn() operations.
> 
> The performance test results, based on v6.15, for completing the 8G VFIO
> IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> case, the 8G virtual address space has been separately mapped to small
> folio and physical memory using hugetlbfs with pagesize=2M. For large
> folio, we achieve an approximate 66% performance improvement. However,
> for small folios, there is an approximate 11% performance degradation.
> 
> Before this patch:
> 
>     hugetlbfs with pagesize=2M:
>     funcgraph_entry:      # 94413.092 us |  vfio_unmap_unpin();
> 
>     small folio:
>     funcgraph_entry:      # 118273.331 us |  vfio_unmap_unpin();
> 
> After this patch:
> 
>     hugetlbfs with pagesize=2M:
>     funcgraph_entry:      # 31260.124 us |  vfio_unmap_unpin();
> 
>     small folio:
>     funcgraph_entry:      # 131945.796 us |  vfio_unmap_unpin();

I was just playing with a unit test[1] to validate your previous patch
and added this as well:

Test options:

	vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]

I'm running it once with device and size for the madvise and populate
tests, then again adding /dev/hugepages (1G) for the remaining test:

Base:
# vfio-pci-mem-dma-map 0000:0b:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.726 s (22.0 GB/s)
VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.071 s (224.0 GB/s)
VFIO UNMAP DMA in 0.103 s (156.0 GB/s)

Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.296 s (54.0 GB/s)
VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.741 s (21.6 GB/s)
VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.010 s (1542.9 GB/s)
VFIO UNMAP DMA in 0.109 s (146.1 GB/s)

Map + Unmap patches:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.301 s (53.2 GB/s)
VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.735 s (21.8 GB/s)
VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.011 s (1434.7 GB/s)
VFIO UNMAP DMA in 0.023 s (686.5 GB/s)

So overall the map optimization shows a nice improvement in hugetlbfs
mapping performance.  I was hoping we'd see some improvement in THP,
but that doesn't appear to be the case.  Will folio_nr_pages() ever be
more than 1 for THP?  The degradation in non-hugetlbfs case is small,
but notable.

The unmap optimization shows a pretty substantial decline in the
non-hugetlbfs cases.  I don't think that can be overlooked.  Thanks,

Alex

[1]https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c