lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFIIsSwv5Si+rG3Z@yzhao56-desk.sh.intel.com>
Date: Wed, 18 Jun 2025 08:30:41 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, "Huang, Kai"
	<kai.huang@...el.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
	<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"Shutemov, Kirill" <kirill.shutemov@...el.com>, "michael.roth@....com"
	<michael.roth@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, Jun 17, 2025 at 08:52:49AM +0800, Yan Zhao wrote:
> On Tue, Jun 17, 2025 at 06:49:00AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > > > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > > > quite
> > > > manual. But here we are playing with guest ABI. In practice, yes it's
> > > > similar to
> > > > passing yet another arg to get a good TD.
> > > Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> > > 
> > > It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> > > provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> > > to
> > > turn on/off SEPT_VE_DISABLE.
> > > (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> > > 
> > > So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> > > first
> > > checks if SEPT_EXPLICIT_DEMOTION is supported.
> > > The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> > > determine to continue or shut down. (If it does not check
> > > SEPT_EXPLICIT_DEMOTION,
> > > e.g., if we don't want to update EDK2, the guest must accept memory before
> > > memory accessing).
> > > 
> > > - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> > > when
> > >   there's no level info in an EPT violation. The guest must accept memory
> > > before
> > >   accessing memory or if it wants to accept only a partial of host's mapping,
> > > it
> > >   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> > > demotion.
> > > 
> > > - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
> > >   when there's no level info in an EPT violation.
> > > 
> > > - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> > > info
> > >   in an EPT violation, while KVM honors the level info as the max_level info,
> > >   KVM ignores the demotion request in the fault path.
Hi Sean,
Could you please confirm if this matches what you think?
i.e.,

  when an EPT violation carries an ACCEPT level info
  KVM maps the page at map level <= the specified level.
  (If KVM finds a shadow-present lead SPTE, it will not try to merge/split it.)
  Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
  specified level.

This can keep linux guests (with SEPT_VE_DISABLE being true) more efficient.
So, for linux guests, if it only wants to accept at 4KB, the flow is
1. guest ACCEPT 4KB
2. KVM maps it at 4KB
3. ACCEPT 4KB returns success

As the ACCEPT comes before KVM actually maps anything, we can avoid the complex
flow:
1. guest ACCEPT 4KB
2. KVM maps it at 2MB
3. ACCEPT 4KB returns PAGE_SIZE_MATCH.
4.(a) guest ACCEPT 2MB or
4.(b) guest triggers TDVMCALL to demote
5. KVM demotes the 2MB mapping
6. guest ACCEPT at 4KB
7. ACCEPT 4KB returns success 

For non-linux guests (with SEPT_VE_DISABLE being false), I totally agree with
your suggestions!

Thanks
Yan

> > I think this is what Sean was suggesting. We are going to need a qemu command
> > line opt-in too.
> > 
> > > 
> > > > We can start with a prototype the host side arg and see how it turns out. I
> > > > realized we need to verify edk2 as well.
> > > Current EDK2 should always accept pages before actual memory access.
> > > So, I think it should be fine.
> > 
> > It's not just that, it needs to handle the the accept page size being lower than
> > the mapping size. I went and looked and it is accepting at 4k size in places. It
> As it accepts pages before memory access, the "accept page size being lower than
> the the mapping size" can't happen. 
> 
> > hopefully is just handling accepting a whole range that is not 2MB aligned. But
> > I think we need to verify this more.
> Ok.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ