linux-kernel - Re: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall areas in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff359a3e-a275-4146-8e99-a06fea69b7a9@linux.microsoft.com>
Date: Mon, 23 Dec 2024 12:30:09 -0800
From: Roman Kisel <romank@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>, Wei Liu <wei.liu@...nel.org>
Cc: "hpa@...or.com" <hpa@...or.com>, "kys@...rosoft.com" <kys@...rosoft.com>,
 "bp@...en8.de" <bp@...en8.de>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "decui@...rosoft.com" <decui@...rosoft.com>,
 "eahariha@...ux.microsoft.com" <eahariha@...ux.microsoft.com>,
 "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
 "mingo@...hat.com" <mingo@...hat.com>,
 "nunodasneves@...ux.microsoft.com" <nunodasneves@...ux.microsoft.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "tiala@...rosoft.com" <tiala@...rosoft.com>,
 "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "x86@...nel.org" <x86@...nel.org>, "apais@...rosoft.com"
 <apais@...rosoft.com>, "benhill@...rosoft.com" <benhill@...rosoft.com>,
 "ssengar@...rosoft.com" <ssengar@...rosoft.com>,
 "sunilmut@...rosoft.com" <sunilmut@...rosoft.com>,
 "vdso@...bites.dev" <vdso@...bites.dev>
Subject: Re: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall
 areas in get_vtl(void)



On 12/20/2024 2:42 PM, Michael Kelley wrote:
> From: Roman Kisel <romank@...ux.microsoft.com> Sent: Friday, December 20, 2024 11:14 AM
>>

[...]

> 
> I will not object to your preferred path of allocating the
> hyperv_pcpu_output_arg when CONFIG_HYPERV_VTL_MODE is set. For
> me, it has been a worthwhile discussion to explore the alternatives. I do,
> however, want to get further clarification about the spec requirements
> and whether our current Linux code is meeting those requirements. More
> on that below.  And allow me to continue the discussion about a couple
> of points you raised.
> 
Thank you, Michael, for helping me explore the options! From where I
sit, we both seem to agree that an update is needed, and where our
opinions diverge is whether we need to spend another per-CPU page or
not. If at any point, you start believing that your "Nack" is the best
option going forward, I'll put this patch aside.

[...]

> I see three uses in current upstream code. And then there are close to
> a dozen more uses in Nuno's year-old patch set for /dev/mshv that hasn't
> been accepted upstream yet. Does that roughly match what you are referring
> to?
> 
While compiling my arguments and making sure they're worth the readers'
time, I did miss to provide references to the code. Yes, that patch set
might be a good illustration of the direction, thank you very much for
pointing that out!

Another reference might have been the kernel used with OpenHCL (6.6.x):
https://github.com/microsoft/OHCL-Linux-Kernel. Azure Boost runs that, 
and the kernel is delighting our customers with the rock-solid
foundation.

It contains lots of mshv code, and it is the code run in Azure as the
VTL2 kernel; we're working towards upstreaming, e.g. here :) It uses the
hypercall output page, and the performance folks beat the hell out of
the code, and found it being up to snuff. There were few findings about
memory consumption out of which I remember the perf folks tuning SWIOTLB
size and tweaking the kernel memory layout to save on padding and
alignment. As for the per-CPU data, nothing allowing to save that much,
perhaps would be nice to be able to compile out more. Yet, as I
understand, x86_64 systems might not always be envisioned as the
commonplace foundation for building embedded systems so while we're
looking to save few KiBs in hyperv, there's likely much more to save
elsewhere.

>> * dom0 and the VTL mode can share code quite naturally,
> 
> Yep.
> 
To me, this has been the ever-shining light in the dark maze of other
arguments. It provides a corner stone to building the generic well-
tested code, hence developers will rest assured when introducing changes
that the changes are robust even when not testing the myriad ways the
hyperv code is used in. That is, the validation wouldn't need all the
permutations of conditional statements and all #ifdef's that would be
needed when special-casing.

>> * Not using the output page cannot acccount for all cases permitted by
>>     the TLFS clause "don't overlap input and output, and don't cross page
>>     boundaries",
> 
> This is what I don’t understand.
> 
I meant that when implementing a generic case, not using the output page
prohibits from being able to receive a page of output when supplying a
page of input. I agree that we can special-case to save a bit of memory
as it appears we don't need that ability (4KiB of output in response to
4KiB input) right now. What I question though if that specialization is
needed at all since it will lead to more code, more parameters for the
functions, more comments and if statements therefore more validation
work and more cognitive load, and more maintenance costs.

>> * Not using the output page everywhere for consistency requires updating
>>     call sites of `hv_do_hypercall` and friends, i.e. every place where a
>>     hypercall is made needs to incorporate the offset/pointer at which the
>>     output should start, or perhaps do some shenanigans with macro's,
> 
> In my musing about getting rid of hyperv_pcpu_output_arg entirely, the
> call signature of hv_do_hypercall() and friends would remain unchanged.
> There would still be a virtual address passed as the output argument (which
> might be NULL, and is *not* required to be page aligned). The caller of
> hv_do_hypercall() must only ensure that the virtual address points to an
> output area that is large enough to contain the output without crossing a
> page boundary. The area must also not overlap with the input area. Given
> that we currently have only 3 uses of hyperv_pcpu_output_arg in upstream
> code, only those would need to be tweaked to split the hyperv_pcpu_input_arg
> page into an input and output area. All other callers of hv_do_hypercall()
> would be unchanged.
> 
Your approach appears being solid to me! Again, going to be beating on
my drum :) I believe here we are deciding not only on the small
optimization which is a clever way of saving a bit of memory, and it did
bring a satisfying "wow, nice" to my mind.

We are drafting the constrains for the future code as it appears;
without knowing better, perhaps it is the wisest not to constrain. As
the quote goes:

"... We should forget about small efficiencies, say about 97% of the
time: premature optimization is the root of all evil. Yet we should not
pass up our opportunities in that critical 3%. A good programmer ...
will be wise to look carefully at the critical code; but only after that
code has been identified. ..."


>> * Not using the output page leads to negligible savings,
>>
>> it is hard to see for me how to make not using the hypercall output page
>> look as a profitable enginnering endeavor, really comes off as dancing
>> to the perf scare/feature creep tune for peanuts.
> 
> There would be some simplification in hv_common_free(), hv_common_init(),
> and hv_common_cpu_init() if there was no hyperv_pcpu_output_arg.  :-)
> 
Agreed!

>>
>> In my opinion, we should use the hypercall output page for the VTL mode
>> as dom0 does to share code and reduce code churn. Had we used some
>> `hv_get_registers` in both instead of the specialized for no compelling
>> imo practical reason `get_vtl` (as it just gets a vCPU register, nothing
>> more, nothing less), this code path would've been better tested, and any
>> of this patching would not have been needed.
> 
> FWIW, the historical context is that at the time get_vtl() was implemented,
> all VP registers that Linux needed to access were available as synthetic
> MSRs on the x86 side. Going back to its origins 15 years ago, the Linux code
> for Hyper-V always just used the synthetic MSRs. get_vtl() was the first time
> that x86 code needed the HVCALL_GET_VP_REGISTERS hypercall because
> HV_REGISTER_VSM_VP_STATUS has no synthetic MSR equivalent. There still
> is no 'hv_get_registers' function in upstream code on the x86 side.
> hv_get_vpreg() exists only on the arm64 side where there are no synthetic
> MSRs or equivalent.


> 
> In hindsight, hv_get_vpreg() could have been added on the x86 side, and
> then get_vtl() implemented on top of that. And I made the get_vtl() situation
> worse by giving Tianyu bad advice about overlapping the output area with
> the input area. :-(
> 
> I once had a manager at Microsoft who said "He reserved the right
> to wake up smarter every day."  I've had to claim that right as well
> from time-to-time. :-)
> 
Michael, thank you very much for your efforts! At that time, no one
knew otherwise iiuc. You had the willpower to make things happen and the
vision to bring the solution over the finish line, and imo the Hyper-V
community just cannot possibly thank you enough for your incredible
work. It is my genuine hope that putting the Fixes tag didn't overshadow
that fact for anyone as these five letters don't tell any of the story,
most assuredly.

>>
>> I'd wait for few days and then would likely prefer to run with Wei's
>> permission to send this patch in v2 as-is unless some damning evidence
>> presents itself.
> 
> Again, I won't object to your taking that path.
> 
> But regarding compliance with the TLFS, take a look at the code for
> hv_pci_read_mmio() and hv_pci_write_mmio().  Do you think that code
> violates the spec?  If not, what would a violation look like in the context
> of this discussion?  If current code is in violation, then we should fix that.
> 
Appreciate your advice! I don't think they do. The code doesn't overlap 
input and output, and to my eye, the arguments are sized and aligned 
appropriately. That does not violate TLFS. I'd think the
`hv_pci_read_mmio` might've used the output page as it is supposed to be
used for output to stick to one pattern, much similar to the `get_vtl`
in question.

Now, it seems the functions run in the VTL0 only (but I'd leave that to
someone else to judge) where we currently don't allocate the output
page, and `hv_pci_read_mmio` is the only function doing that input page
split so special-casing might be justified, perhaps adding a comment
would seem appropriate. Unless there are prospects to merge that with
dom0, I'd leave the code be.

> Michael

-- 
Thank you,
Roman