linux-kernel - Re: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall areas in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8c4613a-33b6-4aa6-a3ae-7c888ab2d727@linux.microsoft.com>
Date: Thu, 19 Dec 2024 15:39:10 -0800
From: Roman Kisel <romank@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>, Wei Liu <wei.liu@...nel.org>
Cc: "hpa@...or.com" <hpa@...or.com>, "kys@...rosoft.com" <kys@...rosoft.com>,
 "bp@...en8.de" <bp@...en8.de>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "decui@...rosoft.com" <decui@...rosoft.com>,
 "eahariha@...ux.microsoft.com" <eahariha@...ux.microsoft.com>,
 "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
 "mingo@...hat.com" <mingo@...hat.com>,
 "nunodasneves@...ux.microsoft.com" <nunodasneves@...ux.microsoft.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "tiala@...rosoft.com" <tiala@...rosoft.com>,
 "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "x86@...nel.org" <x86@...nel.org>, "apais@...rosoft.com"
 <apais@...rosoft.com>, "benhill@...rosoft.com" <benhill@...rosoft.com>,
 "ssengar@...rosoft.com" <ssengar@...rosoft.com>,
 "sunilmut@...rosoft.com" <sunilmut@...rosoft.com>,
 "vdso@...bites.dev" <vdso@...bites.dev>
Subject: Re: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall
 areas in get_vtl(void)



On 12/19/2024 1:37 PM, Michael Kelley wrote:
> From: Roman Kisel <romank@...ux.microsoft.com> Sent: Thursday, December 19, 2024 10:19 AM

[...]

>>
>> There will surely be more hypercall usage in the VTL mode that return
>> data and require the output pages as we progress with upstreaming the
>> VTL patches. Enabling the hypercall output pages allows to fix the
>> function in question in a very natural way, making it possible to
>> replace with some future `hv_get_vp_register` that would work for both
>> dom0 and VTL mode just the same.
>>
>> All told, if you believe that we should make this patch a one-liner,
>> I'll do as you suggested.
>>
> 
> FWIW, Roman and I had this same discussion back in August. See [1].
> I'll add one new thought that wasn't part of the August discussion.
Michael, thank you very much for helping out in finding a better
solution!

> 
> To my knowledge, the hypercalls that *may* use a full page for input
> and/or a full page for output don't actually *require* a full page. The
> size of the input and output depends on how many "entries" the
> hypercall is specified to process, where "entries" could be registers,
> memory PFNs, or whatever. I would expect the code to invoke these
> hypercalls must already deal with the case where the requested number
> of entries causes the input or output size to exceed one page, so the
> code just iterates making multiple invocations of the hypercall with
> a "batch size" that fits in one page.
That is what I see in the code, too. The TLFS requires to use a page
worth of data maximum ("cannot overlap or cross page boundaries"), hence
the hypercall code shall chunk up the input data appropriately.

> 
> It would be perfectly reasonable to limit the batch size so that a
> "batch" of input or output fits in a half page instead of a full page,
> avoiding the need to allocate hyperv_pcpu_output_arg. Or if the
> input and output sizes are not equal, use whatever input vs. output
> partitioning of a single page make sense for that hypercall. The
> tradeoff, of course, is having to make the hypercall more times
> with smaller batches. But if the hypercall is not in a hot path, this
> might be a reasonable tradeoff to avoid the additional memory
> allocation. Or if the hypercall processing time per "entry" is high,
> the added overhead of more invocations with smaller batches is
> probably negligible compared to the time processing the entries.
The hypervisor yields control back to the guest if the hypervisor
spends more than ~a dozen 1e-6 sec in the hypercall processing, and
the processing isn't done yet. When yielding the control back, the
hypervisor doesn't advance the instruction pointer so the guest can
process interrupts on the vCPU (if the guest didn't mask them), and
get back to processing the hypercall. That helps the guest stay
responsive if the guest chose to send larger batches. For smaller
batches, have to consider the cost of invoking the hypercall as you
are pointing out. On the topic of saving CPU time, there are also fast
hypercalls that pass data in the CPU registers.

> 
> This scheme could also be used in the existing root partition code
> that is currently the only user of the hyperv_pcpu_output_arg.
> I could see a valid argument being made to drop
> hyperv_pcpu_output_arg entirely and just use smaller batches.
In my view, that is a reasonable idea to cut down on memory usage.
Here is what I get applying that general idea to this specific
situation (and sticking to 4KiB as the page size).

We are going to save 4KiB * N of memory where N is the number of cores
allocated to the root partition. Let us also introduce M that denotes
the amount of memory in KiB allocated to the root partition.

Given that the order of magnitude for N is 1 (log_10(N) ~= 1), and the
order of magnitude for M is 6..7, the savings (~=10^(N-M)=1e-5) look
rather meager to my eye. That might be a judgement call, I wouldn't
argue that.

What is worrisome is that the guest goes against the specification. The
specification decrees: the input and output areas for the hypercall
shall not cross page boundaries and shall not overlap.
Hence, the hypervisor is within its right to produce up to 4KiB of
output in response to up to 4KiBs of input, and we have:

```
sizeof(input) + sizeof(output) <= 2*sizeof(page)
```

But when the guest doesn't use the output page, we obviously have

```
sizeof(input) + sizeof(output) <= sizeof(page)
```
on the guest side so the contract is broken.

The hypervisor would need to know that the guest optimizes its memory
usage in this way, limiting what is allowed by the specification when
implementing any new hypercalls.

> 
> Or are there hypercalls where a smaller batch size doesn't work
> at all or is a bad tradeoff for performance reasons? I know I'm not
> familiar with *all* the hypercalls that might be used in root
> partition or VTL code. If there are such hypercalls, I would be curious
> to know about them.
Nothing that I could find in the specification. I wouldn't think that
justifies investing in creating specialized/special-cased functions
on the guest side without solid evidence that more code is needed. In
this particular case, one day, I'd love to replace `get_vtl` with
one generic function `hv_get_vp_registers` that works both for dom0 and
VTLs, plays by the book and does not require much/any explaining what is
going on inside it and why. I believe this will make maintenance easier.


> 
> Michael
> 
> [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB415759676AEF931F030430FDD4BE2@SN6PR02MB4157.namprd02.prod.outlook.com/

-- 
Thank you,
Roman