linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEubI/6HkEw/IkUr@yzhao56-desk.sh.intel.com>
Date: Fri, 13 Jun 2025 11:29:39 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Xiaoyao Li <xiaoyao.li@...el.com>
CC: Sean Christopherson <seanjc@...gle.com>, Kai Huang <kai.huang@...el.com>,
	Rick P Edgecombe <rick.p.edgecombe@...el.com>, Kirill Shutemov
	<kirill.shutemov@...el.com>, Fan Du <fan.du@...el.com>, Dave Hansen
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>, Zhiquan Li
	<zhiquan1.li@...el.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"tabba@...gle.com" <tabba@...gle.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Ira Weiny <ira.weiny@...el.com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, Isaku Yamahata <isaku.yamahata@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, Chao P Peng <chao.p.peng@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, Vishal Annapurve
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, Jun Miao
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Fri, Jun 13, 2025 at 10:41:21AM +0800, Xiaoyao Li wrote:
> On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> > On Tue, May 20, 2025, Kai Huang wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > > > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > > > EPT violation* if I understand your words correctly.
> > > > > > > 
> > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > > > instead of 2M if no accept level is provided in the fault?
> > > > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > > > 
> > > > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > > > is small.
> > 
> > No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> > inconsistent, convoluted, and inefficient.
> > 
> > Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> > 
> >    S-EPT state         ACCEPT vs. Mapping Size         Behavior
> >    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
> >    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
> >    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
> >    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
> >    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> > 
> > 
> > If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> > a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> > 
> > The only reason that comes to mind for punting the "too small" case to the VMM
> > is to try and keep the guest alive if the VMM is mapping more memory than has
> > been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> > or buggy.  IMO, that's a terrible reason to push this much complexity into the
> > host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> > accepts everything at 4KiB granularity.
> > 
> > The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> > action, not force the hypervisor to limp along in a degraded state.  If the guest
> > doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> > entire 2MiB/1GiB region is available, then the guest can either log a warning and
> > "poison" the page(s), or terminate and refuse to boot.
> > 
> > If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> > guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> > appropriate granularity, then IMO that's firmly a guest bug.
> 
> It might just be guest doesn't want to accept a larger level instead of
> can't. Use case see below.
> 
> > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > memory, then there should be an explicit TDCALL to request that the unwanted
> > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > obvioulsy created a giant mess.
> 
> Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
> only for the case that VMM has already mapped page and guest only needs to
> accept it to make it available, it also works for the case that guest
> requests VMM to map the page for a gpa (at specific level) then guest
> accepts it.
> 
> Even for the former case, it is understandable for behaving differently for
> the "too small" and "too big" case. If the requested accept level is "too
> small", VMM can handle it by demoting the page to satisfy guest. But when
> the level is "too big", usually the VMM cannot map the page at a higher
> level so that ept violation cannot help. I admit that it leads to the
> requirement that VMM should always try to map the page at the highest
> available level, if the EPT violation is not caused by ACCEPT which contains
> a desired mapping level.
> 
> As for the scenario, the one I can think of is, guest is trying to convert a
> 4K sized page between private and shared constantly, for testing purpose.
> Guest knows that if accepting the gpa at higher level, it takes more time.
> And when convert it to shared, it triggers DEMOTE and more time. So for
> better performance, guest just calls ACCEPT with 4KB page. However, VMM
Hmm, ACCEPT at 4KB level at the first time triggers DEMOTE already.
So, I don't see how ACCEPT at 4KB helps performance.

Support VMM has mapped a page at 4MB,

         Scenario 1                           Effort
  (1) Guest ACCEPT at 2MB                   ACCEPT 2MB         
  (2) converts a 4KB page to shared         DEMOTE
  (3) convert it back to private            ACCEPT 4KB


         Scenario 2                           Effort
  (1) Guest ACCEPT at 4MB                   DEMOTE, ACCEPT 4MB         
  (2) converts a 4KB page to shared
  (3) convert it back to private            ACCEPT 4KB


In step (3) of "Scenario 1", VMM will not map the page at 2MB according to the
current implementation because PROMOTION requires uniform ACCEPT status across
all 512 4KB pages to be succeed.

> returns PAGE_SIZE_MATCH and enforces guest to accept a bigger size. what a
> stupid VMM.
I agree with Sean that if guest doesn't want to accept at a bigger size for
certain reasons (e.g. it thinks it's unsafe or consider it as an attack),
invoking an explicit TDVMCALL may be a better approach.

> Anyway, I'm just expressing how I understand the current design and I think
> it's reasonable. And I don't object the idea to return ACCEPT_SIZE_MISMATCH
> for "too small" case, but it's needs to be guest opt-in, i.e., let guest
> itself chooses the behavior.