[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB415777A957EF622587C54A35D4E4A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Fri, 3 Oct 2025 21:41:32 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v3 3/5] Drivers: hv: Batch GPA unmap operations to improve
large region performance
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Friday, October 3, 2025 9:52 AM
>
> On Fri, Oct 03, 2025 at 12:27:13AM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Thursday, October 2, 2025 9:36 AM
> > >
> > > Reduce overhead when unmapping large memory regions by batching GPA unmap
> > > operations in 2MB-aligned chunks.
> > >
> > > Use a dedicated constant for batch size to improve code clarity and
> > > maintainability.
> > >
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
> > > ---
> > > drivers/hv/mshv_root.h | 2 ++
> > > drivers/hv/mshv_root_hv_call.c | 2 +-
> > > drivers/hv/mshv_root_main.c | 15 ++++++++++++---
> > > 3 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > > index e3931b0f12693..97e64d5341b6e 100644
> > > --- a/drivers/hv/mshv_root.h
> > > +++ b/drivers/hv/mshv_root.h
> > > @@ -32,6 +32,8 @@ static_assert(HV_HYP_PAGE_SIZE == MSHV_HV_PAGE_SIZE);
> > >
> > > #define MSHV_PIN_PAGES_BATCH_SIZE (0x10000000ULL / HV_HYP_PAGE_SIZE)
> > >
> > > +#define MSHV_MAX_UNMAP_GPA_PAGES 512
> > > +
> > > struct mshv_vp {
> > > u32 vp_index;
> > > struct mshv_partition *vp_partition;
> > > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > > index c9c274f29c3c6..0696024ccfe31 100644
> > > --- a/drivers/hv/mshv_root_hv_call.c
> > > +++ b/drivers/hv/mshv_root_hv_call.c
> > > @@ -17,7 +17,7 @@
> > > /* Determined empirically */
> > > #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> > > #define HV_MAP_GPA_DEPOSIT_PAGES 256
> > > -#define HV_UMAP_GPA_PAGES 512
> > > +#define HV_UMAP_GPA_PAGES MSHV_MAX_UNMAP_GPA_PAGES
> > >
> > > #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
> > >
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 7ef66c43e1515..8bf0b5af25802 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1378,6 +1378,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
> > > static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> > > {
> > > struct mshv_partition *partition = region->partition;
> > > + u64 page_offset;
> > > u32 unmap_flags = 0;
> > > int ret;
> > >
> > > @@ -1396,9 +1397,17 @@ static void mshv_partition_destroy_region(struct mshv_mem_region *region)
> > > if (region->flags.large_pages)
> > > unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > >
> > > - /* ignore unmap failures and continue as process may be exiting */
> > > - hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> > > - region->nr_pages, unmap_flags);
> > > + for (page_offset = 0; page_offset < region->nr_pages; page_offset++) {
> > > + if (!region->pages[page_offset])
> > > + continue;
> > > +
> > > + hv_call_unmap_gpa_pages(partition->pt_id,
> > > + ALIGN(region->start_gfn + page_offset,
> > > + MSHV_MAX_UNMAP_GPA_PAGES),
> >
> > Seems like this should be ALIGN_DOWN() instead of ALIGN(). ALIGN() rounds
> > up to get the requested alignment, which could skip over some non-zero
> > entries in region->pages[].
> >
>
> Indeed, the goal is to unmap in 2MB chunks as it's the max payload that
> can be passed to hypervisor.
> I'll replace it with ALIGN_DOWN in the next revision.
>
> > And I'm assuming the unmap hypercall is idempotent in that if the specified
> > partition PFN range includes some pages that aren't mapped, the hypercall
> > just skips them and keeps going without returning an error.
> >
>
> It might be the case, but it's not reliable.
> If the region size isn’t aligned to MSHV_MAX_UNMAP_GPA_PAGES (i.e., not
> aligned to 2M), the hypervisor can return an error for the trailing
> invalid (non-zero) PFNs.
> I’ll fix this in the next revision.
>
> > > + MSHV_MAX_UNMAP_GPA_PAGES, unmap_flags);
> > > +
> > > + page_offset += MSHV_MAX_UNMAP_GPA_PAGES - 1;
> >
> > This update to the page_offset doesn't take into account the effect of the
> > ALIGN (or ALIGN_DOWN) call. With a change to ALIGN_DOWN(), it may
> > increment too far and perhaps cause the "for" loop to be exited prematurely,
> > which would fail to unmap some of the pages.
> >
>
> I’m not sure I see the problem here. If we align the offset by
> MSHV_MAX_UNMAP_GPA_PAGES and unmap the same number of pages, then we
> should increment the offset by that very same number, shouldn’t we?
Here's an example showing the problem I see (assuming ALIGN_DOWN
instead of ALIGN):
1) For simplicity in the example, assume region->start_gfn is zero.
2) Entries 0 thru 3 (i.e., 4 entries) in region->pages[] are zero.
3) Entries 4 thru 515 (the next 512 entries) are non-zero.
4) Entries 516 thru 1023 (the next 508 entries) are zero.
5) Entries 1024 thru 1535 (the last 512 entries) are non-zero.
Upon entering the "for" loop for the first time, page_offset gets
incremented to 4 because of skipping entries 0 thru 3 that are zero.
On the next iteration where page_offset is 4, the hypercall is made,
passing 0 for the gfn (because of ALIGN_DOWN), with a count of
512, so entries 0 thru 511 are unmapped. Entries 0 thru 3 are valid
entries, and the fact that they aren't mapped is presumably ignored
by the hypercall, so everything works.
Then page_offset is incremented by 511, so it will be 515. Continuing
the "for" loop increments page_offset to 516. The zero entries 516
thru 1023 increment page_offset to 1024. Finally the hypercall is made
again covering entries 1024 thru 1535, none of which are zero.
But notice that entries 512 thru 515 (which are non-zero) got skipped.
That's because the first invocation of the hypercall covered only through
entry 511, while page_offset was incremented to 515. page_offset
should have been set to 511, since that's the last entry processed by
the first invocation of the hypercall.
Michael
Powered by blists - more mailing lists