linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCrsi1k4y8mGdfv7@yzhao56-desk.sh.intel.com>
Date: Mon, 19 May 2025 16:32:11 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
	<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"Du, Fan" <fan.du@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"tabba@...gle.com" <tabba@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Sat, May 17, 2025 at 06:35:57AM +0800, Huang, Kai wrote:
> On Fri, 2025-05-16 at 17:43 +0800, Zhao, Yan Y wrote:
> > On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> > > On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > > >   		return PG_LEVEL_4K;
> > > > >   
> > > > > -	return PG_LEVEL_4K;
> > > > > +	return PG_LEVEL_2M;
> > > > 
> > > > Maybe combine this with patch 4, or split them into sensible categories.
> > > 
> > > How about merge with patch 12
> > > 
> > >   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
> > >   ACCEPT level
> > > 
> > > instead?
> > > 
> > > Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> > > KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> > > such info is provided (IIUC):
> > Yes, if without such info (tdx->violation_request_level), we always return
> > PG_LEVEL_2M.
> > 
> > 
> > > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> > > 				       gfn_t gfn)
> > >  {
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > >  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
> > >  		return PG_LEVEL_4K;
> > >  
> > > +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > > +		return tdx->violation_request_level;
> > > +
> > >  	return PG_LEVEL_2M;
> > >  }
> > > 
> > > So why not returning PT_LEVEL_4K at the end?
> > > 
> > > I am asking because below text mentioned in the coverletter:
> > > 
> > >     A rare case that could lead to splitting in the fault path is when a TD
> > >     is configured to receive #VE and accesses memory before the ACCEPT
> > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > >     splitting in the fault path will be triggered. However, this is not
> > >     regarded as a typical use case, as usually TD always accepts pages in
> > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > >     splitting request is an endless EPT violation. This would not happen
> > >     for a Linux guest, which does not expect any #VE.
> > > 
> > > Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
> > For TDs expect #VE, guests access private memory before accept it.
> > In that case, upon KVM receives EPT violation, there's no expected level from
> > the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
> > for those TDs.
> 
> Just want to make sure I understand correctly:
> 
> Linux TDs always ACCEPT memory first before touching that memory, therefore KVM
> should always be able to get the accept level for Linux TDs.
> 
> In other words, returning PG_LEVEL_4K doesn't impact establishing large page
> mapping for Linux TDs.
>
> However, other TDs may choose to touch memory first to receive #VE and then
> accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
> mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
> page for them (since we don't support PROMOTE for now?).
Not only because we don't support PROMOTE.

After KVM maps at 4KB level, if the guest accepts at 2MB level, it would get
a TDACCEPT_SIZE_MISMATCH error.

The case of "KVM maps at 4KB, guest accepts at 2MB" is different from
"KVM maps at 2MB, guest accepts at 4KB".

For the former, the guest can't trigger endless EPT violation. Just consider
when the guest wants to accept at 2MB while KVM can't meet its request.
If it can trigger endless EPT violation, the linux guest should trigger endless
EPT already with today's basic TDX.

> But in the above text you mentioned that, if doing so, because we choose to
> ignore splitting request on read, returning 2M could result in *endless* EPT
> violation.
I don't get what you mean.
What's the relationship between splitting and "returning 2M could result in
*endless* EPT" ?

> So to me it seems you choose a design that could bring performance gain for
> certain non-Linux TDs when they follow a certain behaviour but otherwise could
> result in endless EPT violation in KVM.
Also don't understand here.
Which design could result in endless EPT violation?

> I am not sure how is this OK?  Or probably I have misunderstanding?

> > 
> > Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
> > for linux guests.
> > 
> > [1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t
> 
> I am not sure how is this related?
> 
> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept order,
> e.g., they always accept 4K, there could be *endless EPT violation* if I
> understand your words correctly.
> 
> Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead of
> 2M if no accept level is provided in the fault?
As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH. 

> 
> > 
> > > normal cases either, since guest will always do ACCEPT (which contains the
> > > accepting level) before accessing the memory.