[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a7e883a-440f-4bec-a592-ac3af4cb1677@intel.com>
Date: Fri, 13 Jun 2025 10:41:21 +0800
From: Xiaoyao Li <xiaoyao.li@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Kai Huang <kai.huang@...el.com>
Cc: Yan Y Zhao <yan.y.zhao@...el.com>,
Rick P Edgecombe <rick.p.edgecombe@...el.com>,
Kirill Shutemov <kirill.shutemov@...el.com>, Fan Du <fan.du@...el.com>,
Dave Hansen <dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
Zhiquan Li <zhiquan1.li@...el.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>,
"tabba@...gle.com" <tabba@...gle.com>,
"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Ira Weiny <ira.weiny@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
Isaku Yamahata <isaku.yamahata@...el.com>,
"michael.roth@....com" <michael.roth@....com>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
Chao P Peng <chao.p.peng@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
Vishal Annapurve <vannapurve@...gle.com>, "jroedel@...e.de"
<jroedel@...e.de>, Jun Miao <jun.miao@...el.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
RUNNABLE
On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> On Tue, May 20, 2025, Kai Huang wrote:
>> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
>>> On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
>>>> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
>>>>>> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
>>>>>> accept order, e.g., they always accept 4K, there could be *endless
>>>>>> EPT violation* if I understand your words correctly.
>>>>>>
>>>>>> Isn't this yet-another reason we should choose to return PG_LEVEL_4K
>>>>>> instead of 2M if no accept level is provided in the fault?
>>>>> As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
>>>>> TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
>>>>
>>>> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
>>>> docs say the VMM needs to demote *if* the mapping is large and the accept size
>>>> is small.
>
> No thanks, fix the spec and the TDX Module. Punting an error to the VMM is
> inconsistent, convoluted, and inefficient.
>
> Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
>
> S-EPT state ACCEPT vs. Mapping Size Behavior
> Leaf SEPT_PRESENT Smaller TDACCEPT_SIZE_MISMATCH
> Leaf !SEPT_PRESENT Smaller EPT Violation <=========================|
> Leaf DONT_CARE Same Success | => THESE TWO SHOULD MATCH!!!
> !Leaf SEPT_FREE Larger EPT Violation, BECAUSE THERE'S NO PAGE |
> !Leaf SEPT_FREE Larger TDACCEPT_SIZE_MISMATCH <================|
>
>
> If ACCEPT is "too small", an EPT violation occurs. But if ACCEPT is "too big",
> a TDACCEPT_SIZE_MISMATCH error occurs. That's asinine.
>
> The only reason that comes to mind for punting the "too small" case to the VMM
> is to try and keep the guest alive if the VMM is mapping more memory than has
> been enumerated to the guest. E.g. if the guest suspects the VMM is malicious
> or buggy. IMO, that's a terrible reason to push this much complexity into the
> host. It also risks godawful boot times, e.g. if the guest kernel is buggy and
> accepts everything at 4KiB granularity.
>
> The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> action, not force the hypervisor to limp along in a degraded state. If the guest
> doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> entire 2MiB/1GiB region is available, then the guest can either log a warning and
> "poison" the page(s), or terminate and refuse to boot.
>
> If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> appropriate granularity, then IMO that's firmly a guest bug.
It might just be guest doesn't want to accept a larger level instead of
can't. Use case see below.
> If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> memory, then there should be an explicit TDCALL to request that the unwanted
> regions of memory be unmapped. Smushing everything into implicit behavior has
> obvioulsy created a giant mess.
Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
only for the case that VMM has already mapped page and guest only needs
to accept it to make it available, it also works for the case that guest
requests VMM to map the page for a gpa (at specific level) then guest
accepts it.
Even for the former case, it is understandable for behaving differently
for the "too small" and "too big" case. If the requested accept level is
"too small", VMM can handle it by demoting the page to satisfy guest.
But when the level is "too big", usually the VMM cannot map the page at
a higher level so that ept violation cannot help. I admit that it leads
to the requirement that VMM should always try to map the page at the
highest available level, if the EPT violation is not caused by ACCEPT
which contains a desired mapping level.
As for the scenario, the one I can think of is, guest is trying to
convert a 4K sized page between private and shared constantly, for
testing purpose. Guest knows that if accepting the gpa at higher level,
it takes more time. And when convert it to shared, it triggers DEMOTE
and more time. So for better performance, guest just calls ACCEPT with
4KB page. However, VMM returns PAGE_SIZE_MATCH and enforces guest to
accept a bigger size. what a stupid VMM.
Anyway, I'm just expressing how I understand the current design and I
think it's reasonable. And I don't object the idea to return
ACCEPT_SIZE_MISMATCH for "too small" case, but it's needs to be guest
opt-in, i.e., let guest itself chooses the behavior.
Powered by blists - more mailing lists