linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aEmVa0YjUIRKvyNy@google.com>
Date: Wed, 11 Jun 2025 07:42:21 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Kai Huang <kai.huang@...el.com>
Cc: Yan Y Zhao <yan.y.zhao@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Kirill Shutemov <kirill.shutemov@...el.com>, Xiaoyao Li <xiaoyao.li@...el.com>, 
	Fan Du <fan.du@...el.com>, Dave Hansen <dave.hansen@...el.com>, 
	"david@...hat.com" <david@...hat.com>, Zhiquan Li <zhiquan1.li@...el.com>, 
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>, 
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Ira Weiny <ira.weiny@...el.com>, 
	"vbabka@...e.cz" <vbabka@...e.cz>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	Isaku Yamahata <isaku.yamahata@...el.com>, "michael.roth@....com" <michael.roth@....com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, Chao P Peng <chao.p.peng@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, Vishal Annapurve <vannapurve@...gle.com>, 
	"jroedel@...e.de" <jroedel@...e.de>, Jun Miao <jun.miao@...el.com>, 
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

On Tue, May 20, 2025, Kai Huang wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > EPT violation* if I understand your words correctly.
> > > > > 
> > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > instead of 2M if no accept level is provided in the fault?
> > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > 
> > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > is small.

No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
inconsistent, convoluted, and inefficient.

Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":

  S-EPT state         ACCEPT vs. Mapping Size         Behavior
  Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
  Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
  Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
  !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
  !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|

If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.

The only reason that comes to mind for punting the "too small" case to the VMM
is to try and keep the guest alive if the VMM is mapping more memory than has
been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
or buggy.  IMO, that's a terrible reason to push this much complexity into the
host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
accepts everything at 4KiB granularity.

The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
action, not force the hypervisor to limp along in a degraded state.  If the guest
doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
entire 2MiB/1GiB region is available, then the guest can either log a warning and
"poison" the page(s), or terminate and refuse to boot.

If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
appropriate granularity, then IMO that's firmly a guest bug.

If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
memory, then there should be an explicit TDCALL to request that the unwanted
regions of memory be unmapped.  Smushing everything into implicit behavior has
obvioulsy created a giant mess.