[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB41575CDB3874DB0867FF9E8FD43EA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Mon, 25 Aug 2025 21:01:28 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Mukesh R <mrathor@...ux.microsoft.com>, "kys@...rosoft.com"
<kys@...rosoft.com>, "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>, "decui@...rosoft.com"
<decui@...rosoft.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "hpa@...or.com"
<hpa@...or.com>, "lpieralisi@...nel.org" <lpieralisi@...nel.org>,
"kw@...ux.com" <kw@...ux.com>, "manivannan.sadhasivam@...aro.org"
<manivannan.sadhasivam@...aro.org>, "robh@...nel.org" <robh@...nel.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>, "arnd@...db.de" <arnd@...db.de>
CC: "x86@...nel.org" <x86@...nel.org>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-pci@...r.kernel.org"
<linux-pci@...r.kernel.org>, "linux-arch@...r.kernel.org"
<linux-arch@...r.kernel.org>
Subject: RE: [PATCH v3 1/7] Drivers: hv: Introduce hv_hvcall_*() functions for
hypercall arguments
From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Friday, August 22, 2025 7:25 PM
>
> On 8/21/25 19:10, Michael Kelley wrote:
> > From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Thursday, August 21, 2025 1:50 PM
> >>
> >> On 8/21/25 12:24, Michael Kelley wrote:
> >>> From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Wednesday, August 20, 2025 7:58 PM
> >>>>
> >>>> On 8/20/25 17:31, Mukesh R wrote:
> >>>>> On 4/15/25 11:07, mhkelley58@...il.com wrote:
> >>>>>> From: Michael Kelley <mhklinux@...look.com>
> >>>>>>
> >>>>>>
> >> <snip>
> >>>>>
> >>>>>
> >>>>> IMHO, this is unnecessary change that just obfuscates code. With status quo
> >>>>> one has the advantage of seeing what exactly is going on, one can use the
> >>>>> args any which way, change batch size any which way, and is thus flexible.
> >>>
> >>> I started this patch set in response to some errors in open coding the
> >>> use of hyperv_pcpu_input/output_arg, to see if helper functions could
> >>> regularize the usage and reduce the likelihood of future errors. Balancing
> >>> the pluses and minuses of the result, in my view the helper functions are
> >>> an improvement, though not overwhelmingly so. Others may see the
> >>> tradeoffs differently, and as such I would not go to the mat in arguing the
> >>> patches must be taken. But if we don't take them, we need to go back and
> >>> clean up minor errors and inconsistencies in the open coding at some
> >>> existing hypercall call sites.
> >>
> >> Yes, definitely. Assuming Nuno knows what issues you are referring to,
> >> I'll work with him to get them addressed asap. Thanks for noticing them.
> >> If Nuno is not aware, I'll ping you for more info.
> >>
> >>
> >>>>> With time these functions only get more complicated and error prone. The
> >>>>> saving of ram is very minimal, this makes analyzing crash dumps harder,
> >>>>> and in some cases like in your patch 3/7 disables unnecessarily in error case:
> >>>>>
> >>>>> - if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
> >>>>> - pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
> >>>>> - HV_MAX_MODIFY_GPA_REP_COUNT);
> >>>>> + local_irq_save(flags); <<<<<<<
> >>>>> ...
> >>>
> >>> FWIW, this error case is not disabled. It is checked a few lines further down as:
> >>
> >> I meant disabled interrupts. The check moves after disabling interrupts, so
> >> it runs "disabled" in traditional OS terminology :).
> >
> > Got it. But why is it problem to make this check with interrupts disabled?
>
> You are creating disabling overhead where that overhead previously
> did not exist.
I'm not clear on what you mean by "disabling overhead". The existing code
does the following:
1) Validate that "count" is not too big, and return an error if it is.
2) Disable interrupts
3) Populate the per-cpu hypercall input arg
4) Make the hypercall
5) Re-enable interrupts
With the patch, steps 1 and 2 are done in a different order:
2) Disable interrupts
1) Validate that "count" is not too big. Re-enable interrupts and return an error if it is.
3) Populate the per-cpu hypercall input arg
4) Make the hypercall
5) Re-enable interrupts
Validating "count" with interrupts disabled is probably an additional
2 or 3 instructions executed with interrupts disabled, which is negligible
compared to the thousands (or more) of instructions the hypercall will
execute with interrupts disabled.
Or are you referring to something else as "disabling overhead"?
>
>
> > The check is just for robustness and should never be true since
> > hv_mark_gpa_visiblity() is called from only one place that already ensures
> > the PFN count won't overflow a single page.
> >
> >>
> >>>
> >>> + if (count > batch_size) {
> >>> + pr_err("Hyper-V: GPA count:%d exceeds supported:%u\n", count,
> >>> + batch_size);
> >>>
> >>>>>
> >>>>> So, this is a nak from me. sorry.
> >>>>>
> >>>>
> >>>> Furthermore, this makes us lose the ability to permanently map
> >>>> input/output pages in the hypervisor. So, Wei kindly undo.
> >>>>
> >>>
> >>> Could you elaborate on "lose the ability to permanently map
> >>> input/output pages in the hypervisor"? What specifically can't be
> >>> done and why?
> >>
> >> Input and output are mapped at fixed GPA/SPA always to avoid hyp
> >> having to map/unmap every time.
> >
> > OK. But how does this patch set impede doing a fixed mapping?
>
> The output address can be varied depending on the hypercall, instead
> of it being fixed always at fixed address:
>
> *(void **)output = space + offset; <<<<<<
Agreed. But since mappings from GPA to SPA are page granular, having
such a fixed mapping means that there's a mapping for every byte in
the page containing the GPA to the corresponding byte in the SPA,
right? So even though the offset above may vary across hypercalls,
the output GPA still refers to the same page (since the offset is always
less than 4096), and that page has a fixed mapping. I would expect the
hypercall code in the hypervisor to look for an existing mapping based
on the output page, not the output address that includes the offset.
But I'm haven't looked at the hypervisor code. If the Hyper-V folks say
that a non-zero offset thwarts finding the existing mapping, what does
the hypervisor end up doing? Creating a 2nd mapping wouldn't seem
to make sense. So I'm really curious about what's going on ....
Michael
>
> > Wouldn't that fixed mapping be done at the time the page or pages
> > are allocated, and then be transparent to hypercall call sites?
> >
> >>
> >>> <snip>
> >>>
> >>>>>
> >>>>>> +/*
> >>>>>> + * Allocate one page that is shared between input and output args, which is
> >>>>>> + * sufficient for all current hypercalls. If a future hypercall requires
> >>>>>
> >>>>> That is incorrect. We've iommu map hypercalls that will use up entire page
> >>>>> for input. More coming as we implement ram withdrawl from the hypervisor.
> >>>
> >>> At least some form of ram withdrawal is already implemented upstream as
> >>> hv_call_withdraw_memory(). The hypercall has a very small input using the
> >>> hv_setup_in() helper, but the output list of PFNs must go to a separately
> >>> allocated page so it can be retained with interrupts enabled while
> >>> __free_page() is called. The use of this separate output page predates the
> >>> introduction of the hv_setup_in() helper.
> >>
> >> Yeah, I am talking about hyp memory that loader gives it, and during the
> >> lifetime it accumulates for VMs. We are opening the flood gates, so you
> >> will see lots patches very soon.
> >>
> >>
> >>> For iommu map hypercalls, what do the input and output look like? Is the
> >>> paradigm different from the typical small fixed portion plus a variable size
> >>> array of values that are fed into a rep hypercall? Is there also a large amount
> >>> of output from the hypercall? Just curious if there's a case that's fundamentally
> >>> different from the current set of hypercalls.
> >>
> >> Patches coming soon, but at high level, hypercall includes list of SPAs
> >> that hypevisor will map into the iommu. These can get large. We will be
> >> exploring what we can do better to pass them, perhaps multiple pages, not
> >> sure yet, but for now it's single page.
> >
> > To be clear, if the iommu hypercall does not produce any output, this patch
> > set uses the entire per-cpu hypercall arg page for input. For example,
>
> Good
>
> > hv_mark_gpa_visibility() uses the entire page for input, which is mostly an
> > array of PFNs.
> >
> > Using multiple input pages is definitely a new paradigm, on both the
> > hypervisor and guest sides, and that will need additional infrastructure,
> > with or without this patch set.
>
> Right. With this patch set, every hcall is affected rather than just one
> when code is modified to support that. That means one must test every
> hypercall.
>
> > I'm just trying to understand where there are real technical blockers vs.
> > concern about the style and the encapsulation the helpers impose.
>
> Well no technical blockers that can't be resolved, but style and obfuscation
> that helpers impose are big concern.
>
> Thanks,
> -Mukesh
Powered by blists - more mailing lists