linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aC07+s9VvNFCG1ZI@yzhao56-desk.sh.intel.com>
Date: Wed, 21 May 2025 10:35:38 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Du, Fan" <fan.du@...el.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"tabba@...gle.com" <tabba@...gle.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "michael.roth@....com"
	<michael.roth@....com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, May 21, 2025 at 07:34:52AM +0800, Huang, Kai wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Zhao, Yan Y wrote:
> > > But in the above text you mentioned that, if doing so, because we choose to
> > > ignore splitting request on read, returning 2M could result in *endless* EPT
> > > violation.
> > I don't get what you mean.
> > What's the relationship between splitting and "returning 2M could result in
> > *endless* EPT" ?
> > 
> > > So to me it seems you choose a design that could bring performance gain for
> > > certain non-Linux TDs when they follow a certain behaviour but otherwise could
> > > result in endless EPT violation in KVM.
> > Also don't understand here.
> > Which design could result in endless EPT violation?
> 
> [Sorry somehow I didn't see your replies yesterday in my mailbox.]
> 
> You mentioned below in your coverletter:
> 
>     (b) with shared kvm->mmu_lock, triggered by fault.
> 
>     ....
> 
>     This series simply ignores the splitting request in the fault path to
>     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
>     at a lower level would finally figures out the page has been accepted
>     at a higher level by another vCPU.
> 
>     ... The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.
> 
> So to me, IIUC, this means:
> 
>  - this series choose to ignore splitting request when read ..
>  - the worse outcome to ignore the resulting splitting request is an endless
>    EPT violation..
> 
> And this happens exactly in below case:
> 
>  1) Guest touches a 4K page
>  2) KVM AUGs 2M page
>  3) Guest re-accesses that 4K page, and receives #VE
>  4) Guest ACCEPTs that 4K page, this triggers EPT violation
> 
> IIUC, you choose to ignore splitting large page in step 4) (am I right???). 
> Then if guest always ACCEPTs page at 4K level, then KVM will have *endless EPT
> violation*.
> 
> So, is this the "worst outcome to ignore the resulting splitting request" that
> you mentioned in your changelog?
> 
> If it is, then why is it OK?
Initially I assumed the guest should always accept in the sequence of
"1G->2M->4K" as what's linux guest is doing.

If that's true, we can simply ignore the splitting request in the fault (shared)
path because it's the guest that not follow the convention.

However, Kirill and you are right, the guest can accept at 4K.

Given that, the "worst outcome to ignore the resulting splitting request" is not
OK. 

> It is OK *ONLY* when "guest always ACCEPTs 4K page" is a buggy behaviour of the
> guest itself (which KVM is not responsible for).  I.e., the guest is always
> supposed to find the page size that KVM has AUGed upon receiving the #VE (does
> the #VE contain such information?) and then do ACCEPT at that page level.
> 
> Otherwise, if it's a legal behaviour for the guest to always ACCEPT at 4K level,
> then I don't think it's OK to have endless EPT violation in KVM.
We can avoid the endless EPT violation by allowing the splitting in the fault
path, which involves the introduction of several locks in TDX code though. I had
a POC for that one, but we felt that it's better to keep the initial support
simple.

So, if we all agree not to support huge pages for non-Linux TDs as an initial
step, your proposal is a good idea to keep splitting code simple.