[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250523085415.6f316c84.alex.williamson@redhat.com>
Date: Fri, 23 May 2025 08:54:15 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: lizhe.67@...edance.com
Cc: david@...hat.com, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
muchun.song@...ux.dev, peterx@...hat.com
Subject: Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for
large folio
On Fri, 23 May 2025 11:42:38 +0800
lizhe.67@...edance.com wrote:
> On Thu, 22 May 2025 14:52:07 -0600, alex.williamson@...hat.com wrote:
>
> > On Thu, 22 May 2025 16:25:24 +0800
> > lizhe.67@...edance.com wrote:
> >
> > > On Thu, 22 May 2025 09:22:50 +0200, david@...hat.com wrote:
> > >
> > > >On 22.05.25 05:49, lizhe.67@...edance.com wrote:
> > > >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@...hat.com wrote:
> > > >>
> > > >>>> From: Li Zhe <lizhe.67@...edance.com>
> > > >>>>
> > > >>>> When vfio_pin_pages_remote() is called with a range of addresses that
> > > >>>> includes large folios, the function currently performs individual
> > > >>>> statistics counting operations for each page. This can lead to significant
> > > >>>> performance overheads, especially when dealing with large ranges of pages.
> > > >>>>
> > > >>>> This patch optimize this process by batching the statistics counting
> > > >>>> operations.
> > > >>>>
> > > >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > >>>> address space has been mapped to physical memory using hugetlbfs with
> > > >>>> pagesize=2M.
> > > >>>>
> > > >>>> Before this patch:
> > > >>>> funcgraph_entry: # 33813.703 us | vfio_pin_map_dma();
> > > >>>>
> > > >>>> After this patch:
> > > >>>> funcgraph_entry: # 16071.378 us | vfio_pin_map_dma();
> > > >>>>
> > > >>>> Signed-off-by: Li Zhe <lizhe.67@...edance.com>
> > > >>>> Co-developed-by: Alex Williamson <alex.williamson@...hat.com>
> > > >>>> Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
> > > >>>> ---
> > > >>>
> > > >>> Given the discussion on v3, this is currently a Nak. Follow-up in that
> > > >>> thread if there are further ideas how to salvage this. Thanks,
> > > >>
> > > >> How about considering the solution David mentioned to check whether the
> > > >> pages or PFNs are actually consecutive?
> > > >>
> > > >> I have conducted a preliminary attempt, and the performance testing
> > > >> revealed that the time consumption is approximately 18,000 microseconds.
> > > >> Compared to the previous 33,000 microseconds, this also represents a
> > > >> significant improvement.
> > > >>
> > > >> The modification is quite straightforward. The code below reflects the
> > > >> changes I have made based on this patch.
> > > >>
> > > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > >> index bd46ed9361fe..1cc1f76d4020 100644
> > > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > > >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > > >> return ret;
> > > >> }
> > > >>
> > > >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> > > >> +{
> > > >> + long i;
> > > >> + unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> > > >> +
> > > >> + for (i = 1; i < npage; ++i) {
> > > >> + if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> > > >> + break;
> > > >> + next_pfn++;
> > > >> + }
> > > >> + return i;
> > > >> +}
> > > >
> > > >
> > > >What might be faster is obtaining the folio, and then calculating the
> > > >next expected page pointer, comparing whether the page pointers match.
> > > >
> > > >Essentially, using folio_page() to calculate the expected next page.
> > > >
> > > >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP,
> > > >so that might be rather fast.
> > > >
> > > >
> > > >So we'd obtain
> > > >
> > > >start_idx = folio_idx(folio, batch->pages[batch->offset]);
> > >
> > > Do you mean using folio_page_idx()?
> > >
> > > >and then check for
> > > >
> > > >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)
> > >
> > > Thank you for your reminder. This is indeed a better solution.
> > > The updated code might look like this:
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index bd46ed9361fe..f9a11b1d8433 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > > return ret;
> > > }
> > >
> > > +static inline long continuous_pages_num(struct folio *folio,
> > > + struct vfio_batch *batch, long npage)
> >
> > Note this becomes long enough that we should just let the compiler
> > decide whether to inline or not.
>
> Thank you! The 'inline' here indeed needs to be removed.
>
> > > +{
> > > + long i;
> > > + unsigned long start_idx =
> > > + folio_page_idx(folio, batch->pages[batch->offset]);
> > > +
> > > + for (i = 1; i < npage; ++i)
> > > + if (batch->pages[batch->offset + i] !=
> > > + folio_page(folio, start_idx + i))
> > > + break;
> > > + return i;
> > > +}
> > > +
> > > /*
> > > * Attempt to pin pages. We really don't want to track all the pfns and
> > > * the iommu can only map chunks of consecutive pfns anyway, so get the
> > > @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > */
> > > nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
> > > folio_page_idx(folio, batch->pages[batch->offset]));
> > > - if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > > - nr_pages = 1;
> > > + if (nr_pages > 1) {
> > > + if (vfio_find_vpfn_range(dma, iova, nr_pages))
> > > + nr_pages = 1;
> > > + else
> > > + nr_pages = continuous_pages_num(folio, batch, nr_pages);
> > > + }
> >
> >
> > I think we can refactor this a bit better and maybe if we're going to
> > the trouble of comparing pages we can be a bit more resilient to pages
> > already accounted as vpfns. I took a shot at it, compile tested only,
> > is there still a worthwhile gain?
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 0ac56072af9f..e8bba32148f7 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> > /*
> > * Helper Functions for host iova-pfn list
> > */
> > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +
> > +/*
> > + * Find the first vfio_pfn that overlapping the range
> > + * [iova_start, iova_end) in rb tree.
> > + */
> > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > + dma_addr_t iova_start, dma_addr_t iova_end)
> > {
> > struct vfio_pfn *vpfn;
> > struct rb_node *node = dma->pfn_list.rb_node;
> > @@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > while (node) {
> > vpfn = rb_entry(node, struct vfio_pfn, node);
> >
> > - if (iova < vpfn->iova)
> > + if (iova_end <= vpfn->iova)
> > node = node->rb_left;
> > - else if (iova > vpfn->iova)
> > + else if (iova_start > vpfn->iova)
> > node = node->rb_right;
> > else
> > return vpfn;
> > @@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > return NULL;
> > }
> >
> > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +{
> > + return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE);
> > +}
> > +
> > static void vfio_link_pfn(struct vfio_dma *dma,
> > struct vfio_pfn *new)
> > {
> > @@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > return ret;
> > }
> >
> > +static long contig_pages(struct vfio_dma *dma,
> > + struct vfio_batch *batch, dma_addr_t iova)
> > +{
> > + struct page *page = batch->pages[batch->offset];
> > + struct folio *folio = page_folio(page);
> > + long idx = folio_page_idx(folio, page);
> > + long max = min_t(long, batch->size, folio_nr_pages(folio) - idx);
> > + long nr_pages;
> > +
> > + for (nr_pages = 1; nr_pages < max; nr_pages++) {
> > + if (batch->pages[batch->offset + nr_pages] !=
> > + folio_page(folio, idx + nr_pages))
> > + break;
> > + }
> > +
> > + return nr_pages;
> > +}
> > +
> > +static long vpfn_pages(struct vfio_dma *dma,
> > + dma_addr_t iova_start, long nr_pages)
> > +{
> > + dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> > + struct vfio_pfn *vpfn;
> > + long count = 0;
> > +
> > + do {
> > + vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);
>
> I am somehow confused here. Function vfio_find_vpfn_range()is designed
> to find, through the rbtree, the node that is closest to the root node
> and satisfies the condition within the range [iova_start, iova_end),
> rather than the node closest to iova_start? Or perhaps I have
> misunderstood something?
Sorry, that's an oversight on my part. We might forego the _range
version and just do an inline walk of the tree counting the number of
already accounted pfns within the range. Thanks,
Alex
> > + if (likely(!vpfn))
> > + break;
> > +
> > + count++;
> > + iova_start = vpfn->iova + PAGE_SIZE;
> > + } while (iova_start < iova_end);
> > +
> > + return count;
> > +}
> > +
> > /*
> > * Attempt to pin pages. We really don't want to track all the pfns and
> > * the iommu can only map chunks of consecutive pfns anyway, so get the
> > @@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > * and rsvd here, and therefore continues to use the batch.
> > */
> > while (true) {
> > + long nr_pages, acct_pages = 0;
> > +
> > if (pfn != *pfn_base + pinned ||
> > rsvd != is_invalid_reserved_pfn(pfn))
> > goto out;
> >
> > + nr_pages = contig_pages(dma, batch, iova);
> > + if (!rsvd) {
> > + acct_pages = nr_pages;
> > + acct_pages -= vpfn_pages(dma, iova, nr_pages);
> > + }
> > +
> > /*
> > * Reserved pages aren't counted against the user,
> > * externally pinned pages are already counted against
> > * the user.
> > */
> > - if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> > + if (acct_pages) {
> > if (!dma->lock_cap &&
> > - mm->locked_vm + lock_acct + 1 > limit) {
> > + mm->locked_vm + lock_acct + acct_pages > limit) {
> > pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > __func__, limit << PAGE_SHIFT);
> > ret = -ENOMEM;
> > goto unpin_out;
> > }
> > - lock_acct++;
> > + lock_acct += acct_pages;
> > }
> >
> > - pinned++;
> > - npage--;
> > - vaddr += PAGE_SIZE;
> > - iova += PAGE_SIZE;
> > - batch->offset++;
> > - batch->size--;
> > + pinned += nr_pages;
> > + npage -= nr_pages;
> > + vaddr += PAGE_SIZE * nr_pages;
> > + iova += PAGE_SIZE * nr_pages;
> > + batch->offset += nr_pages;
> > + batch->size -= nr_pages;
> >
> > if (!batch->size)
> > break;
>
> Thanks,
> Zhe
>
Powered by blists - more mailing lists