[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFvBNromdrkEtPp6@yzhao56-desk.sh.intel.com>
Date: Wed, 25 Jun 2025 17:28:22 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
RUNNABLE
On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > tdx_gmem_private_max_mapping_level())?
>
> This is one of the previous two methods discussed. Can you elaborate on what you
> are trying to say?
I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
to convey the max_level info at which a vendor hopes a GFN to be mapped.
Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
after TDX huge pages, it returns
- 4KB during the TD build stage
- at TD runtime: 4KB or 2MB
Why does KVM need to care how the vendor determines this max_level?
I think a vendor should have its freedom to decide based on software limitation,
guest's wishes, hardware bugs or whatever.
> > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > private fault?
> >
> > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > promotion, though we don't support it right now.
>
> Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> directly, but rely on the logic that checks for mixed attributes. But more
> below...
>
> >
> > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > Setting "accepted" attribute in the EPT violation handler?
> > It's a little odd, as the accept operation is not yet completed.
>
> I guess the question in both of these comments is: what is the life cycle. Guest
> could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> accept attribute is not going work, at least without TDX module changes.
>
>
> Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> mixed and then individual accepts would fail.
>
>
> So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
level info?
I guess it's the latter one as it can avoid modification to both EDK2 and Linux
guest. I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
2MB" during the boot-up of a TD with 4GB memory.
But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> based on mixed attributes. It would be one way. It would need to get set by
> something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> before going into the fault handler on 4k accept size. It would have to take mmu
> write lock I think, which would kill scalability in the 4k accept case (but not
> the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> problem, which the operation would also need to do.
>
> I think it actually makes KVM's behavior easier to understand. We don't need to
> worry about races between multiple accept sizes and things like that. It also
> leaves the core MMU code mostly untouched. Performance/scalability wise it only
> punishes the rare case.
Write down my understanding to check if it's correct:
- when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
always maps at 4KB
- When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
(a)
1. guest accepts at 4KB
2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
3. KVM maps at 4KB (with read mmu_lock)
4. guest's 4KB accept succeeds.
(b)
1. guest accepts at 2MB.
2. KVM maps at 4KB due to a certain reason.
3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
4. guest accepts at 4KB
5. guest's 4KB accept succeeds.
> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
ok.
Powered by blists - more mailing lists