[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC3z_gUxJbY1_JP7@x1.local>
Date: Wed, 21 May 2025 11:40:46 -0400
From: Peter Xu <peterx@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: Alex Williamson <alex.williamson@...hat.com>, lizhe.67@...edance.com,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
muchun.song@...ux.dev
Subject: Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge
folio
On Wed, May 21, 2025 at 08:35:47AM +0200, David Hildenbrand wrote:
> On 21.05.25 00:21, Alex Williamson wrote:
> > On Tue, 20 May 2025 19:38:45 +0200
> > David Hildenbrand <david@...hat.com> wrote:
> >
> > > On 20.05.25 09:00, lizhe.67@...edance.com wrote:
> > > > From: Li Zhe <lizhe.67@...edance.com>
> > >
> > > Subject: "huge folio" -> "large folios"
> > >
> > > >
> > > > When vfio_pin_pages_remote() is called with a range of addresses that
> > > > includes huge folios, the function currently performs individual
> > >
> > > Similar, we call it a "large" f
> > >
> > > > statistics counting operations for each page. This can lead to significant
> > > > performance overheads, especially when dealing with large ranges of pages.
> > > >
> > > > This patch optimize this process by batching the statistics counting
> > > > operations.
> > > >
> > > > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > > address space has been mapped to physical memory using hugetlbfs with
> > > > pagesize=2M.
> > > >
> > > > Before this patch:
> > > > funcgraph_entry: # 33813.703 us | vfio_pin_map_dma();
> > > >
> > > > After this patch:
> > > > funcgraph_entry: # 15635.055 us | vfio_pin_map_dma();
> > > >
> > > > Signed-off-by: Li Zhe <lizhe.67@...edance.com>
> > > > Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
> > > > ---
> > > > Changelogs:
> > > >
> > > > v2->v3:
> > > > - Code simplification.
> > > > - Fix some issues in comments.
> > > >
> > > > v1->v2:
> > > > - Fix some issues in comments and formatting.
> > > > - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> > > > - Move the processing logic for huge folio into the while(true) loop
> > > > and use a variable with a default value of 1 to indicate the number
> > > > of consecutive pages.
> > > >
> > > > v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> > > > v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
> > > > 1 file changed, 37 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index 0ac56072af9f..48f06ce0e290 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> > > > /*
> > > > * Helper Functions for host iova-pfn list
> > > > */
> > > > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +
> > > > +/*
> > > > + * Find the first vfio_pfn that overlapping the range
> > > > + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> > > > + */
> > > > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > > > + dma_addr_t iova, unsigned long npage)
> > > > {
> > > > struct vfio_pfn *vpfn;
> > > > struct rb_node *node = dma->pfn_list.rb_node;
> > > > + dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> > > > while (node) {
> > > > vpfn = rb_entry(node, struct vfio_pfn, node);
> > > > - if (iova < vpfn->iova)
> > > > + if (end_iova <= vpfn->iova)
> > > > node = node->rb_left;
> > > > else if (iova > vpfn->iova)
> > > > node = node->rb_right;
> > > > @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > return NULL;
> > > > }
> > > > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +{
> > > > + return vfio_find_vpfn_range(dma, iova, 1);
> > > > +}
> > > > +
> > > > static void vfio_link_pfn(struct vfio_dma *dma,
> > > > struct vfio_pfn *new)
> > > > {
> > > > @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > > * and rsvd here, and therefore continues to use the batch.
> > > > */
> > > > while (true) {
> > > > + struct folio *folio = page_folio(batch->pages[batch->offset]);
> > > > + long nr_pages;
> > > > +
> > > > if (pfn != *pfn_base + pinned ||
> > > > rsvd != is_invalid_reserved_pfn(pfn))
> > > > goto out;
> > > > + /*
> > > > + * Note: The current nr_pages does not achieve the optimal
> > > > + * performance in scenarios where folio_nr_pages() exceeds
> > > > + * batch->capacity. It is anticipated that future enhancements
> > > > + * will address this limitation.
> > > > + */
> > > > + nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> > > > + folio_page_idx(folio, batch->pages[batch->offset]));
> > > > + if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > > > + nr_pages = 1;
> > >
> > >
> > > You seem to assume that the batch really contains the consecutive pages
> > > of that folio.
> >
> > I don't think we are. We're iterating through our batch of pages from
> > GUP to find consecutive pfns. We use the page to get the pfn, the
> > folio, and immediately above, the offset into the folio. batch->size is
> > the remaining length of the page array from GUP and batch->offset is our
> > current index into that array.
>
> Let me try again using an example below ....
>
> > > This is not the case if we obtained the pages through GUP and we have
> > >
> > > (a) A MAP_PRIVATE mapping
> > >
> > > (b) We span multiple different VMAs
> > >
> > >
> > > Are we sure we can rule out (a) and (b)?
> > >
> > > A more future-proof approach would be at least looking whether the
> > > pages/pfns are actually consecutive.
> >
> > The unmodified (pfn != *pfn_base + pinned) test is where we verify we
> > have the next consecutive pfn. Maybe I'm not catching the dependency
> > you're seeing on consecutive pages, I think there isn't one unless
> > we're somehow misusing folio_page_idx() to get the offset into the
> > folio.
>
> Assume our page tables look like this (case (a), a partially mapped large
> pagecache folio mixed with COW'ed anonymous folios):
>
> + page[0] of folio 0
> | + COWed anonymous folio (folio 1)
> | | + page[4] of folio 0
> | | |
> v v v
> F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7
>
> If we GUP that range, we get exactly these pages, except that the PFNs are
> not consecutive, because F0P3 was replaced by another page. The large folio
> is partially mapped.
>
>
> Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
> because we assume the batch would contain F1P0, where it really contains
> F0P4?
Looks like a real issue (even if unlikely setup)..
Before a next-gen GUP.. Maybe we should stick with memfd_pin_folios(),
that'll require mmap read lock taken though when seeing a hugetlb folio, so
it'll be a fast path to try to ping hugetlb-only vmas.
VFIO will also need to check in the fast path on: (1) double check it's a
hugetlb VMA (in case vma layout changed after GUP and before mmap read
lock), (2) VMA boundary check, making sure it's not out-of-bound (3) stick
with VM_SHARED only for now (I don't think anyone uses MAP_PRIVATE anyway
that will also care about how fast it pins..). Then 1G will also work
there..
Thanks,
--
Peter Xu
Powered by blists - more mailing lists